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Helicopter transmission monitoring for predictive failure. The first successful use of a neural network to diagnose a previously 
unknown fault in an operational Navy aircraft was achieved recently. The basic concept behind this Office of Naval Research 
supported research is to use a model of the hippocampal (pertaining to a specific section of the brain) processes for recognition 
memory w to train the model to recognize the normal range of vibration or noise signals from a mechanical device. See article 
on page 
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connection strengths between individual calls are strengthened or weakened. This theory can be ussed to describe 
connection strengths (i.e. leaming) in real brain cells or in abstract neural networks. These figures describe and predict 
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Introduction 


Joel Davis, Guest Editor 
Office of Naval Research 


This issue of Naval Research Reviews represents the third 
opportunity for me to “show case” the outstanding work being 
performed by some of our Office of Naval Research investi- 
gators in the area of neural networks. In 1985, I predicted in 
this journal [37 (4): 1985] that biologically motivated neural 
networks were shortly to be developed (“We are at the point 
where realistic and powerful models of cellular information 
processing networks underlying learning and memory can be 
developed”). Six years later in this journal [43 (4): 1991], I 
presented work from three laboratories well on their way to 
precise functional descriptions of computational neural proc- 
esses. 

In this edition, I have chosen ONR funded contributors 
who are making the transition from research and development 
(R&D) to product. These are some of the first, successful 
attempts to apply neural network algorithms to the solution of 
real world problems. Dan Hammerstrom, one of the contribu- 
tors to this volume, has eloquently written about the difficulty 
in hardware implementation of neural networks. Although 
neural networks are an extremely powerful set of techniques 
that are being extensively used in the real world and have 
become invaluable as a paradigm for research into human 
intelligence and biological computing structures, they have 
not yet revolutionized human/computer interfaces and intelli- 
gent computing. One response is, of course, that this field, if 
not in its infancy, is probably only in early adolescence. A 
second response is that neural networks currently being used 
constitute a part of a larger system. Optical Character Recog- 
nition (OCR) is a good example. Most commercial OCR 
systems use neural network classifiers, but other, more tradi- 
tional, computational networks are added to the system. These 
“hybrid” systems may represent the near term applications 
avenue. 

Another applications problem involves the analog vs. 
digital issue. More has been written on this topic than could 
be covered in this forum. Whereas, analog networks hold out 
the promise of ultra-low power operation which will enhance 
down-sizing and portability constraints; these systems are 
currently hard to mass produce and to get to work right over a 
range of temperatures, voltages, and other operating condi- 
tions. I would like to suggest that these problems are solvable. 
In fact, ONR is taking the lead in examining these shortcom- 
ings of silicon analog systems. 

This issue of Naval Research Reviews features three arti- 
cles from contributors who have bridged the gap from basic 
research to technology applications. All of them have been 
participants in the ONR Small Business Initiative Program. 
Mr. Vincent Schaper and Mr. Doug Harry of the ONR Indus- 
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trial Outreach Division should be commended for their vision 
and actions which have stimulated members of the small 
business community to take an active role in neural network 
applications. These young entrepreneurs are an integral part of 
transitioning this technology to the Navy and the marketplace. 

Another article describes a very thrilling Navy-relevant 
neural network application that may have already saved lives 
in an operational situation. The article describes the work of 
Professor Mark Gluck, currently at Rutgers University, and 
Mr. Robert Kolesar at Naval Command, Control and Ocean 
Surveillance Center in San Diego and the application of a 
biologically inspired algorithm to diagnose faults in a helicop- 
ter gearbox. The network has its roots in the basic research on 
animal and human learning performed by Dr. Gluck more than 
a decade ago as a post-doctoral student at Stanford supported 
by an ONR grant to his mentor. At first glance, the connection 
between animal learning and defective helicopter gear boxes 
may be difficult to make. However, a strong basic research 
program requires a strategy that is willing to see connections 
and take chances for the first time. 

Rear Admiral Marc Pelaez, Chief of Naval Research, has 
noted that “The Navy in recent times has taken a lot of hits for 
supporting such basic research as neural research ... if you were 
to just look at a research abstract, you might not understand 
why the Navy would be funding such work, but this neural net 
(Dr. Gluck’s novelty detector) is the result of exactly that sort 
of research which has come under question, and we’re still 
only elucidating a small piece of the potential applications.” 

The human brain and its functional abilities represent the 
ultimate in evolutionary processes that began millions of years 
ago. Understanding how learning and memory takes place is 
one of the most important basic questions still to be answered 
in the life sciences. I believe that neural networks will allow 
us to organize our neuroscience knowledge, suggest new test- 
able hypotheses, and allow the fruits of that research to en- 
hance the Navy’s mission. I hope this issue of Naval Research 
Reviews conveys some of the current excitement in the neural 
network applications area. 





Mechanical Fault 
Detection System 


A neural network mechanical fault detection and classifi- 
cation technique developed by Mark A. Gluck, an assistant 
professor at the Center for Molecular and Behavioral Neuro- 
science, Rutgers University, Newark, recently proved its effec- 
tiveness when it confirmed a faulty transmission in a Marine 
Corps CH-46 helicopter. This project is funded by the Office 
of Naval Research (ONR). Gluck is a former recipient of 
ONR’s young Investigator Award, which is given each year to 
a promising young researcher. 

When an unknown anomaly first appeared in data gath- 
ered on this presumed problem-free helicopter initial analysis 
proved inconclusive. Gluck’s neural-based technique was put 
into operation and quickly and accurately indicated that a 
problem did exist. The helicopter, which had just participated 
in a major fleet amphibious exercise, was taken out of service 
and its aft transmission removed. An engineering analysis of 
the transmission revealed three possibly serious gear faults that 
had gone undetected by the aircrew. 

“These faults were significant enough to pull the helicop- 
ter out of operation to prevent any other problems that might 
have translated from the faults,” said Rear Admiral Marc 
Pelaez, Chief of Naval Research. “Although the neural-nct was 
only in the testing stages, this maturing technology already has 
proven it’s capable of outperforming conventional methods of 
testing.” 

According to Pelaez, the Navy expects to have neural- 
based systems installed on a helicopter by this spring as part 
of the Air Vehicle Diagnostic System (AVDS) Advanced Tech- 
nology Demonstration to demonstrate their effectiveness as 
early warning systems. 

“This is a whole new approach to fault detection with 
tremendous commercial applications,” he noted. 

The fault detection and classification system analyzes 
vibrational signals given off by helicopter gearboxes to detcr- 
mine the health of the component. An important feature of this 
neural-based diagnostic technique is the hippocampal (a spe- 
cific section of the brain)-based network developed by Gluck. 
Gluck’s neural-based network will, for the first time, allow 
characterization of mechanical faults that are heretofore un- 
known. This feature is called “novelty detection.” 

Gluck’s hippocampal-based network has also been used 
to detect and classify both faults in aircraft carrier fire pumps 
as well as sonar signals. 

Besides having the potential of saving lives through an 
important adjunct for safety through early and accurate detec- 
tion and classification of faults, the neural-based system also 
is expected to significantly reduce operations and maintenance 
costs for the Navy and other potential users, said Pelaez. 
Instead of overhauling equipments on a “time-based” sched- 


ule, the neural-based system should allow for “conditioned- 
based” maintenance, whereby machines are repaired or recon- 
ditioned only when there is objective evidence of failure, he 
said. 

Development of the neural-based detection and classifi- 
cation technologies incorporating Gluck’s hippocampal algo- 
rithms has taken place at the Naval Command Control and 
Ocean Surveillance Center in San Diego, under the direction 
of Robert Kolesar, Head of the Advanced Development Group. 

The hippocampal-based technology developed by Gluck 
consists of computer programs using the basic hippocampal 
learning process common to humans and other animals. Key 
to this system is the essential component of memory formation 
and recognition based on past experiences and information to 
determine the relative value of new and different cues and 
inputs. 

For example, before flight the hippocampal network is 
taught to learn the vibrational patterns from “good” gearboxes 
to determine a range of normal operations. Once the system 
has been trained on a wide variety of “good” data, the hippo- 
campal model is brought into play to detect and classify 
anomalous indications falling outside of the range of “good- 
ness.” The system works by assigning values (numbers) to 
various inputs based upon whether they fall into the range 
considered normal. 

“The system, in essence, comes to learn what a normal 
gearbox sounds like, so it can then identify an abnormal 
vibrational pattern when it is inputted,” explained Gluck. “It’s 
the same principle as someone knowing what his car engine 
sounds like, and being able to identify when something is 
wrong because the engine sounds different. No one else may 
be able to hear the difference, but the owner of the car has come 
to ‘learn’ what sounds normal, and is able to determine when 
something is wrong.” 

Unlike the human mind which can follow only about four 
to five streams of information at a time when analyzing com- 
plex problems, the hippocampal-based network is capable of 
handling streams of information numbering in the hundreds, 
providing the potential for analyses previously outside the 
range of human ability and knowledge. Besides its applica- 
tions in mechanical fault determination, the hippocampal- 
based network holds the potential of being utilized in such 
fields as medical diagnoses and economic forecasting. 

The hippocampal-based network has its roots in the basic 
research on animal and human learning performed by Gluck 
more than a decade ago as a post-doctoral student at Stanford 
University. In his initial research, which was supported by 
ONR funding, Gluck worked with rabbits to see if they could 
be taught to learn how to blink their eyes according to specific 
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patterns, and how that learning took place. From there, his 
research moved to working with individuals who suffered 
from amnesia and hippocampal damage to determine more 
clearly the hippocampal-based processes involved in memory 
formation and recognition. 

In addition to the development of neural-based systems 
for analyses and detection, Gluck’s research also paves the 
way for improved understanding of both normal and abnormal 
functions of the human mind. For example, it may become 
possible to use his research for the design of learning tasks that 
would pinpoint the location of damage to the brain that is the 
source of different types of learning difficulties. With these 
sorts of indicators in hand, it then may become possible to 
develop better diagnostic and treatment methods to assist those 
who suffer from the mental disorder occurring as a result of 
such problems as Alzheimer’s disease, schizophrenia, strokes, 
and traumatic brain injuries. 
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Neural Network 
Applications Using the 
Ni1000 Recognition 
Accelerator 


Michael Glier and Mark Laird, Nestor, Inc. and Dr. Leon Cooper, Brown University 


Editor’s Note 


The fastest neural network processor, the Nil000 Recognition Accelerator (computer chip), was developed recently by Intel 
Corp. of Santa Clara, California, and Nestor Corp. of Providence, Rhode Island, with Office of Naval Research and Advanced 
Research Projects Agency (ARPA) funding. The new technology is the most promising approach toward building intelligent 
machines that mimic hearing, seeing and thinking. The chip is amazingly quick at recognizing handwriting, identifying military 
targets and performing other tasks that are difficult or impossible for conventional chips. There are numerous civilian applications 
for this chip, including finger print identifications, automatic mailing address processing, and even stock market forecasting and 
predicting. 

These chips work more like the human brain then the microprocessors used in millions of personal computers. Because they 
can recognize visual or sound patterns at high speed, neural nets are being applied to tricky tasks such as distinguishing human 
voices and zip codes. ARPA is interested in these chips for identifying submarines and other targets. 

Nester developed a version of the handwriting-recognition algorithm for the Nil000. A scanner based on a fast version of 
Intel’s 486 microchip can recognize about 30 handwritten characters per second while the Ni1000 is expected to recognize 5,000 
to 10,000 characters. 

Where other chips answer precise mathematical questions, neural net chips can be trained to work on more subjective 
problems. Interconnected processing elements on each chip, called neurons, join in different ways when exposed to different 
signals. By employing a large number of processing elements that operate in parallel, the Nil000 performs 25 billion 
interconnection operations per second. The chip uses a large block or flash memory so that learned patterns can be “memorized” 
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Introduction 


For many years computation has been dominated by 
considerations of speed and efficiency since complex, time- 
consuming calculations had to be accomplished with limited 
means. We now have to adjust to an era of plenty, in which 
vastly increased memory and processing power makes use and 
convenient access more important than speed and raw compu- 
tational power. 

In a curiously parallel fashion, neural networks have also 
been dominated by computational constraints; in some of these 
networks, learning is very slow and in all of them the intrinsi- 
cally parallel computations are not efficiently executed on 
serial machines. This time of constraint may now be ending. 

Recently, Intel and Nestor Corporations, under Advanced 
Research Projects Agency (ARPA) and Office of Naval Re- 
search (ONR) contracts, have produced the Nil000, a 3.7 
million transistor VLSI chip that has 1024 sparsely connected 
neurons of 256-input dimension with 64 outputs. This chip has 
an on-board RISC processor and on-chip learning. It can 
perform up to 16 billion integer operations or about 33,000 
classifications per second and is expected to provide real-time 
performance in various military and commercial applications. 

Since many Nil000 chips can run in parallel and since 
future generations of hardware will no doubt increase the 
number of neurons while decreasing power needs and cost, we 
are entering a time of hardware plenty for neural networks. 
Rather than struggling to make neural networks small and less 
complex, and rather than conserving the number of neurons 
employed, we may in the future be primarily concerned with 
ease of use, accuracy and ability to generalize. 

Biology provides us with an example, the brain. With this 
instrument, we manage with stunning success (at least occa- 
sionally) to recognize patterns and make rapid, if sometimes 
incorrect, decisions in complex real-world situations. 


The Nil000 
Recognition Accelerator 


The Nil000 recognition accelerator is a high performance 
radial basis function (RBF) neural network chip. The chip 
offers the ability to accelerate neural network applications with 
performance equivalent to up to 16.5 billion operations on a 
general purpose microprocessor. 

The Nil000 design is shown in Figure 1. It consists of a 
parallel, pipelined radial basis function neural network, made 
up of 3 independent functional units: 

e Aradial basis function neural network classifier 

© 16-bit RISC microcontroller 

© A bus interface 
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Figure 1. 


Block diagram of Ni1000 Recognition Accelerator Chip 






































The Radial Basis Function 
Neural Network Classifier 


The classifier contains: 

e 512 parallel distance calculation units 

e Aprototype array memory, which can store up to 1024 

prototype vectors in on-chip FLASH memory 

e A6-stage pipelined math unit 

The 1024 prototype by 256 feature array memory is 
organized such that two prototype vectors are associated, and 
physically adjacent to, one of the 512 distance calculation units 
(DCU). Each of the DCUs first calculates the distance between 
the input feature vector and one of the DCU’s two local 
prototype vectors. This distance is transferred to the math unit. 
Then the DCU calculates the distance between the input fea- 
ture vector and its other local prototype vector, if necessary. 
This second operation does not occur if fewer than 500 proto- 








Figure 2. 


Ni1000 Data Flow. 
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types are in use. Instead, the next vector begins processing, 
which increases throughput. 

The math unit produces deterministic results (a list of 
firing classes) and 16-bit floating point probabilistic results 
(6-bit exponent, 10-bit mantissa) in parallel. It has a6K RAM 
for storing two full sets of results for all 64 classes, both 
deterministic and probabilistic. The math unit can accept one 
new distance on every clock. 


A 16-bit RISC Microcontroller 


The on-chip, custom, 16-bit RISC microcontroller has 
separate program and data memories. The 4K x 16-bit non- 
volatile FLASH EPROM memory can hold training algo- 
rithms, chip maintenance utilities and other software required 
by the application. A general purpose 256 x 16-bit RAM is also 
accessible to the microcontroller. A set of standard microcode 
is provided with the chip that implements a simple race-free 
interface protocol and two training algorithms, Restricted 
Coulomb Energy (RCE) and Probabilistic Neural Network 
(PNN). 

It is also possible to do all of the training external to the 
chip, then load the trained network parameters into the Ni1000. 
Functions are provided in the standard microcode to facilitate 
loading data into the chip. Functions are also provided to copy 
data from the chip, which can be used to replicate the trained 
network in other chips. 


The Bus Interface 


The Nil000 bus interface provides double buffers on the 
input to permit pipelining of input vectors. An external master 
can access the chip’s I/O registers to control and monitor the 
classifier, and to communicate with the microcontroller. 


Operation 


At 33 MHz, the Nil000 can classify over 32,000 input 
vectors per second, where each input vector has up to 223 
features, each with 5-bit resolution. This performance level is 
made possible by the Nil000’s parallel architecture, which 
executes up to 16.5 billion operations per second. A typical 
Von Neumann machine would need to execute more than 65 
billion instructions per second to approach the processing rate 
achieved by the Nil000 Recognition Accelerator. 

The Ni1000 data flow is shown in Figure 2. The Nil000 
supports all neural network learning on-chip, via an embedded 
microcontroller. Because the chip includes this embedded 
microcontroller, the user does not need to be a neural network 
expert to implement a complete neural network system. Train- 
ing can be accomplished by simply presenting the patterns and 
their corresponding class labels to the chip. 

Separate dual-input data buffers and a single output buffer 
are provided. This permits simultaneous pipelined operation 
on up to three input patterns. The output buffer provides 
several output data formats to support various application 
requirements, including integer and single-precision IEEE 
floating point. 

Since prototypes are stored in nonvolatile FLASH mem- 
ory, no off-chip prototype memory or performance-stealing 
prototype loading operations are needed. The chip stores ap- 
proximately 6 Kbytes of prototype parameter data in its on- 
chip RAM. This is stored in RAM since it must change during 
the training process. However, once the training process is 
completed, this data can be made nonvolatile by copying it into 
reserved FLASH. A microcontroller firmware routine is then 
used to copy the FLASH data into the prototype parameter 
RAM each time the chip is powered up. 

The low hardware overhead required to incorporate the 
chip into a system is further enhanced by the fact that no 
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external boot or program memory is required, since the inter- 
nal microcontroller’s program memory is also FLASH mem- 
ory. 

The in-circuit reprogramming of the Nil000’s FLASH 
memory also provides a mechanism for enhancing the chip’s 
learning algorithms in the field, reducing the cost of upgrades 
and maintenance, and preserving the customer’s investment. 


Application #1: Optical 
character recognition 


Several manufacturers of document image processing 
equipment are developing new, low cost document readers 
using the Nil000. Nestor itself is developing a high-speed 
image processing and recognition system for high-volume 
“smart-forms” transaction processing. Called NiReader, it will 
feature Nil000 chips as its core recognition processing ele- 
ments. 

Such products are expected to dramatically improve both 
the speed and the accuracy of the processing of hand-printed 
forms, significantly reducing the cost of paper work for busi- 
ness and government. For example, in health-care systems 
smart forms are beginning to eliminate administrative bottle- 
necks, dramatically reducing the time required for the collec- 
tion of patient information and for the payment to providers. 


Application #2: Mail Sorting 


A manufacturer of mail sorting equipment has developed 
a low-cost envelope reader which reads the destination zip 
code from the printed address field on an envelope, then prints 
a postal bar code on the envelope. Businesses can save up to 
10 cents per item when mailing bulk mail by using bar coding 
techniques. The Ni1000 permits this reader to process mail at 
up to 50,000 pieces per hour, dramatically reducing bulk 
mailing costs. 

Initial testing on one bulk mailing achieved 99.7% accu- 
racy on Canadian (alphanumeric) zip codes. Although the 
board’s interface could not support the required data rate, the 
theoretical classification rate for the Nil000 on this problem 
was 143,000 characters per second. 


Application #3: Intellient 
Forms Processing (N’Form) 


Nestor’s N’Form is a document identification and routing 
product based on the Nil000. N’Form features robust and 
accurate identification of document form types, including 
multi-part forms, without any required pre-processing to cor- 
rect for skew or document rotation. The prototype version 
already exhibits high accuracy forms recognition of dozens of 
form types under difficult conditions, including 180 rotation, 
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notes in margins, significantly defaced forms and even stick- 
on notes. 

N’Form will deliver important benefits to high volume 
document scanning operations, including an estimated 4x 
increase in scanner operator productivity, support for fast 
scanning job changes, fully integrated forms processing, in- 
cluding Intelligent Character Recognition (band printed) 
(ICR), Optical Charater Recognition (machine printed) 
(OCR), and Optical Mark Recognition (OMR) and bar code 
recognition plus automated document indexing and routing. 

N’Form will automatically perform key image quality 
assessment functions, including orientation detection (for cor- 
rection of documents imaged upside down or rotated), blank 
form detection (to identify forms that have not been filled out) 
and blank page detection (to correctly detect the true form 
presented as a sequence of two images from a scanner operat- 
ing in duplex mode). Single or multi-part documents can be 
routed either to a user- specified directory associated with the 
form type, or to another application for next stage processing. 

A separate Document Administrator module will support 
forms training capability directly from images of sample 
forms. In addition to initial training, this permits training on 
new forms and on images that are rejected by the system. 

The product, targeted for availability by mid-1996, will 
support real-time forms identification operating at 120 im- 
ages/minute from uncompressed or TIFF G3, 2d or G4 com- 
pressed images. 


Application #4: 
Traffic Monitoring 


Vehicle detection systems using video technology are 
currently being deployed. Neural networks can improve on 





Figure 3. 


Operation of a typical video detection system. 
































these systems through more robust detection (e.g. bicycles, 
pedestrians and emergency vehicles) and an ability to classify 
vehicles (or other objects). Figure 3 shows the operation of a 
traditional video-based vehicle detection system. Figure 3. 
Operation of a typical video detection system. 

Asystem using vehicle classification can track the vehicle 
(or object) from camera to camera. Traffic engineers and traffic 
control systems can use this information for intelligent corri- 
dor management, and law enforcement can track vehicle 
movement to avoid high speed chases. Figure 4. Detection 
model architecture. 

Figure 4 illustrates the role of the neural network in the 
detection system. The trained network indicates the position 
of the vehicle in the image by producing a Gaussian shaped 
curve (Figure 4) where the peak of the curve indicates the 
longitudinal center of the vehicle within the detection zone. 

The Nil000 can support video data classification from up 
to 4 standard CCTV cameras at 30 frames/second. This infor- 
mation processing rate permits the Nil000 to capture and 
classify multiple frames containing the same vehicle to obtain 
a high confidence classification. 

In recent tests at a live traffic signal installation at Louisi- 
ana State University’s Remote Sensing Laboratory, the Nil000 
achieved 92% accuracy when used as a vehicle presence and 
passage detector in day, night and dawn/dusk transition condi- 
tions. For comparison, an existing highly-optimized and dedi- 
cated image processing system has up to 95% accuracy under 
full daylight or night time conditions, but exhibits significantly 
degraded accuracy during dusk and dawn transition periods, 
when traffic is often at its peak. 


Application #5: Vision systems 


Neural-network-based visual inspection systems can 
quickly “learn” to recognize flaws in new products (or new 
versions of existing products), without the need for specialized 
programming or time-consuming system adjustments. Quality 
assurance can be fine- tuned, and instantly adjusted as products 





Figure 4. 


Detection model architecture. 








and markets evolve. Neural-network-based intelligent inspec- 
tion systems are in use today in pharmaceutical, automobile 
and even potato chip manufacturing. 


Application #6: 
Fingerprint Matching 


Current fingerprint identification methods use high-cost 
mainframe-based systems that are too large and expensive to 
deploy in local police departments. As a result, automated 
fingerprint identification can involve long time delays - as long 
as one week for certain FBI files. In addition, their accuracy, 
when matching partial fingerprints, is unsatisfactory 

One manufacturer has developed an Ni1000-based finger- 
print matching system that promises to move this capability to 
local police departments. This will allow rapid identification 
of suspects, reducing the time that known criminals remain 
unidentified. 

These Nil000 based systems will allow local law enforce- 
ment officials rapid analysis, based on readily available digit- 
ized fingerprint databases, and will also provide improved 
accuracy from matches of partial prints - all at a fraction of the 
cost of existing systems. A similar device for digitized photo 
matching is also being developed. 


Nil000 Specifications 


The Nil000 comes in a 168-pin CPGA and can operate 
from 10 MHz to 33 MHz in the commercial temperature range. 
The part can also operate over an extended temperature range, 
with some reduction in maximum operating frequency. When 
running at 33 MHz, it dissipates 3 watts at SV, and requires 30 
mA of +12V during programming or erasure of the FLASH 
memories. Its I/O signal levels are TTL compatible. 

At33 MHz, it classifies a minimum of 32,000 patterns per 
second when employing all available input features and pro- 
totypes. Smaller classification problems can be run substan- 
tially faster. In fact, some real applications that have been run 
on the Nil000 execute at well over 100,000 patterns per 
second. 

The chip itself contains over 3.7 million transistors on a 
13mm x 15mm dice. It is fabricated by Intel using their 0.8 m 
CHMOS-IV process. 


Nil000-Based Products 


A development system is available for developing solu- 
tions using the Nil000. ISA, PCI and VMEbus compatible 
boards are currently available and Nil000s can be purchased 
separately for custom board designs. 
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Summary 


Neural networks have evolved into high performance, 
low-cost and easy-to-incorporate solutions. Using the Nil000, 
they can deliver capabilities formerly limited to supercomput- 
ers, on a PC or workstation. 

Applications developers are rapidly learning about the 
capabilities of neural networks, and are beginning to employ 
them in hundreds of applications. The dramatic success stories 
of Nil000 customers prove the wide applicability of this 
technology. 

Integration times (often only a few days to a few weeks, 
when replacing other software or hardware neural networks) 
demonstrate the ease of integration of the hardware and soft- 
ware. 
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Neural Network 
Applications Speed The 
Navy’s Warfighting Ability 


A.J. Maren, R.M. Pap, K.L. Priddy, and R. M. Akita 
Accurate Automation Corporation 


Editor’s Note 


This article describes the successful results of two Navy programs working together. First, funding came from the 
Office of Naval Research's Cognitive and Neural Science and Technology Division; these funds were then channeled through 
the Navy's Small Business Innovation Research (SBIR) program to Accurate Automation Corporation (ACC), a small 
business enterprise in Chattanooga, Tennessee, which has successfully developed novel neural network technologies for 
the Navy. With the help of the SBIR program, ACC has grown from a basement shop of two people in 1987 to 20 people 
today in its own building of 10,000 square feet; revenues have grown during the same time from $18,000 to $4,000,000. 

ACC has developed among other technologies for the Navy “Neural Network Toolbox” and the Sparse MIMD Neural 
Network Processor, which will make a PC computer work as fast as a super computer. 

Mentioned in this article are the three phases of funding managed by SBIR to promote small business. In phase 1, up 
to $100,000 is awarded to a small business for 6 months to evaluate the technical merit and feasibility of an idea; phase 
II awards up to $750,000 for two years to expand the results of phase 1 by developing a product; and phase III allows for 
the commercialization of the product through a “buyer” in the government or private industry. 

Thus by choosing small companies with the right capabilities for the job, the Navy can improve the scientific as well 
as the business strength of the nation. 
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Introduction 


There’s an old adage among’st fighter pilots, Speed is life, 
more is better. These pilots spoke only in reference to their 
platforms. In today’s world, the need for speed applies to all 
areas of warfighting. 

The German army, during the 1940’s, mastered the art of 
“Blitzkrieg,” or “lightning warfare,” which referred to the 
technique of rapidly building for and executing an attack. 
Today, all areas of warfare demand speed and precision. This 
applies not only to the abilities of the forces in direct contact, 
but also - and most especially - to functions that support the 
warrior in combat, including sensor fusion, data communica- 
tions, database management, target and image recognition, and 
rapid surveillance. These capabilities directly influence our 
ability to carry the battle to the enemy, and are the critical 
elements in our ability to strike an early, crippling blow. 

According to Adm. William A. Owens, Vice Chairman of 
the Joint Chiefs of Staff, “Read the flagship pronouncements 
of each of the military services: The Army’s description of 
Force XXI, the Navy’s Forward ... from the Sea, the Air Force’s 
Global Reach, Global Power, and the Marine Corps’ Opera- 
tional Maneuver ... from the Sea. The visions they sketch are 
remarkably similar. Each points toward the capability to use 
military force with greater precision, less risk, and more effec- 
tiveness. Each relies on three areas of technology: 


e Intelligence, surveillance, and reconnaissance, 

e Advanced command, control, communications, com- 

puters, and intelligence, and 

e Precision-guided munitions. 

Each recognizes that its efforts are a part of a broader 
undertaking. I believe that this is the U.S. revolution in joint 
military affairs.” 

Neural networks are a key enabling technology that will 
enhance all three of these critical technologies. Today’s battle- 
field, whether on air, land, or sea, uses a number of neural 
network hardware and/or software applications. For example, 
all modems make use of neural network technology for adap- 
tive echo cancellation [1]. Many of the neural network inno- 
vations are inspired by biological neural networks. The 
diversified interplay between different neural network re- 
search programs has led to a strong basis for technology 
development and transition to fielded use by the Navy. 

To meet these Navy needs, we have been developing 
novel neural network technologies that will support and speed 
the Navy’s warfighting capabilities on many levels. These 
technology developments, sponsored by the Navy Small Busi- 
ness Innovation Research (SBIR) program, include neural 
adaptive control (leading to advanced flight control methods), 
sensor fusion and figure-of-merit determination as well as data 
compression and automatic target recognition. Neural net- 
works apply to Militarily Critical Technologies [2] such as 
sensor fusion and signal processing, hypersonic / waverider 
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Figure 1. 


Overall controller system design. 





Gea Desired End- 
Bud-effector effector Position Desired 


Position 




















aircraft design and control, simulation / visualization methods, 
and intelligent processing equipment. In the latter area, in 
order to develop a platform that can give the Navy maximal 
computational speed, Accurate Automation Corporation 
(AAC) has developed a Multiple Instruction, Multiple Data 
(MIMD) Neural Network Processor (NNP™). 

The following is a summary of AAC’s recent develop- 
ments in neural network technology for diversified Navy 
applications. 


Neural Adaptive Control 


An autonomous control system is one in which the system 
itself generates the appropriate control action and/or trajectory. 
An autonomous controller, whether used in robotics or for 
flight control, should use desired position values in relation to 
current position to determine the appropriate motion within a 
range that is dictated by the capabilities of the plant. Adaptive 
control systems are those systems that change their own para- 
metric structure to compensate for changes in the plant being 
controlled. 

As part of ongoing work in adaptive control, AAC has 
developed novel neural network methods for inverse kinemat- 
ics determination, under a Phase II SBIR contract funded by 
the Office of Naval Research (ONR). Our Neural Network 
Processor has been used to solve the inverse kinematics prob- 
lem in near-real-time. These solutions were tested on a real 
robot using a VME card cage and a dual DSP (TI-TMS 
320-C40) implementation attached to the Neural Network 
Processor. The tests were run at NASA Marshall Space Flight 
Center, AL, using the Proto Flight Manipulator Arm or PFM. 
In addition to the inverse kinematics using neural networks, a 
unique joint controller [3] was developed using the functional 
link neural network paradigm. The overall concept of this is 
shown in Figure 1. 

The inverse kinematics problem can be explained as a set 
of coupled equations which couple joint parameters to the 
desired end effector trajectory. Thus, by solving this set of 





coupled equations we can determine the desired joint moves 
to obtain the desired trajectory. The solution of optimization 
problems has been found using recurrent networks [4-8] for a 
variety of applications. We will show how the solution of sets 
of equations using a linear Hopfield network can also be 
modified to solve the inverse kinematics problem. 

Classical manipulator kinematics describes the position 
of an end effector based upon the positions of the joints of a 
robot arm. Generally, this is given by a set of non-linear 
equations, f(.) in joint space, (8). 


y) =f 6@) (i) 


What we really want to know in the inverse kinematics 
case are the desired joint positions, (8), to obtain y(t). The 
inversion of f(.) is difficult due to the dimensionality, multiple 
joints, and the inherent non-linearities found in robot motors. 


8) =f" (y@) (2) 


Acommon method to facilitate the solution of the inverse 
kinematics problem is to linearize the forward kinematics, i.e. 
to differentiate f(.) with respect to time, yielding the velocity 
equation (or differential kinematics) 


dy _d -7(9) 2 
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where J (8) is the Jacobian of 8. 

The prescribed trajectory y(t) is then tracked by a linear 
approximation via inversion of the linear velocity equation and 
integrating the obtained joint angle velocities. 

Using this approach, the inverse kinematics problem re- 
duces to the inversion of the Jacobian matrix J[6(t)] at all time 
instants along y(t), yielding 


@® _ 7-9) YY 


The solution for the inversion of the Jacobian is obtained 
using a combination of a feedforward neural network and a 
linear Hopfield network. The Hopfield neural network solves 
an equation of the form 


xi + 1)=Wx(i)+u (5) 


where 


W is a symmetric (n x n) real or complex weight 
matrix, 


u is areal or complex n-vector of inputs, and 
x(i) is the ith iteration of the n-dimensional 
vector of neuron states. 
The resulting equation converges to a solution when the 
spectral radius, p(W), is less than unity, i.e. the eigenvalues of 





Figure 2. 


A hybrid linear dynamic network, comprised of a feedforward 
layer followed by a linear Hopfield network. 






























































W lie within the unit circle in the complex plane. When solving 
equations of the form Ax = b, the required finesse is to set the 
weight matrix and input for the Hopfield network to 


W=I-aA"A (6) 
u=0aA"b (7) 
where 
A# is the Hermitian (complex conjugate transpose) of A. 
These equations converge for 


2 


8 
Oa AA) (8) 


In order to speed computations the spectral radius is often 
replaced by the trace which is the upper bound for the spectral 
radius. 


(9) 


7 
a< ———— 
trace(A"A) 


The neural network implementation of the linear Hopfield 
solution for a system of equations of the form Ax = b is shown 
in Figure 2. 

The inverse kinematic problem is solved by forming a 
difference equation for (t) at a given discrete time along the 
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Figure 3. 


The robot manipulator model used for analysis of the inverse 
kinematics problem. 





Figure 4. 


Desired and actual end-effector trajectory in Cartesian space 
(1000 points). The error of mode 2 vanished when the trajec- 
tory consisted of 1500 points. 


























y(t) trajectory. The required solution is then used to move the 
joints to the desired position until the next time increment. 
Typically, the solution is obtained in a few hundred iterations 
of the Hopfield network on an NNP which is much faster than 
real-time to a robot joint. 


A@ (i + 1) = WAO (i) +u (10) 


where 
u is a real or complex n-vector of inputs held 
constant at time t. 
A 6 (i) is the ith iteration of the n-dimensional 
vector of neuron states for time t. 

The first step is to determine the values for the weight 
matrix and the input, u, for the Hopfield neural network. From 
the previous explanations it is fairly straightforward to see that 
the weight matrix and input to the Hopfield network for the 
linearized method of computing the desired joint positions are 
given by 


W=/l-oJ"J (11) 


u=oal"b (12) 


2 
btn trace(J“J) 
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which yields the “best” solution for A@ for each time step in 
a least squares sense. Once we find the solution for A@, we 
issue incremental changes to the joints which maintains the 
desired trajectory. 

The hybrid feedforward-Hopfield neural network is capa- 
ble of solving the inverse kinematics problem in real-time 
when implemented on the AAC Neural Network Processor 
(NNP™). The Hopfield neural network was shown to con- 
verge to an optimal solution in a least squares sense and is 
applicable to a variety of optimization problems. 





Figure 5. 


Joint angle trajectories computed by the hybrid feedfor- 
ward/linear Hopfield network. 
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This capability simulated the motion of a three joint robot 
arm on a Silicon Graphics 4D380VGX superminicomputer as 
depicted in Figure 3. The arm was given a straight line trajec- 
tory in Cartesian space consisting of 1000 points and 1500 
points. The trajectory for the 1000 point case was identical to 
the desired with one slight exception in one of the dimensions 
as depicted in Figure 4. The 1500 point case was exact in all 
three dimensions. The joint angles generated by the hybrid 
neural network are shown in Figure 5. 

As aresult of the success in the inverse kinematics project, 
the concepts were applied to flight controls for LoOFLYTE. 


Neurocontrol for the Low 
Observable Flight Test 
Experiment (LoFLYTE) Aircraft 


The Low-Observable Flight Test Experiment (LoFLYTE) 
Advanced Technology Testbed Aircraft is being built to dem- 
onstrate neural network technologies in a real world aircraft 
using some of the Navy-funded SBIR research at AAC. 
LoFLYTE is laying the foundation for developing next gen- 


eration aircraft. The LoFLYTE hypersonic aircraft, shown in 
Figure 6, is a research test vehicle to investigate performance 
of a waverider aircraft shape at hypersonic velocities (regime 
of Mach 5). A waverider is an aircraft which rests on its own 
shock wave as it flies. Waveriders such as LoOFLYTE would 
typically utilize SCRAMIJET engines. 

Hypersonic aircraft, such as LoOFLYTE, could be used in 
a number of applications. First, they could be configured as an 
unmanned surveillance platform. Due to their high speed, they 
will be able to make passes over areas with minimal concerns 
about being shot down. This will greatly increase our surveil- 
lance capabilities during hostilities. Second, a hypersonic 
cruise missile could be configured with a capability to reach 
enemy targets much faster than conventional cruise missiles. 
The need for rapid attacks is supported by A. Fields Richardson 
(Capt. USN, Ret.), who, during Desert Storm, served as Prin- 
cipal Navy Liaison to the Joint Forces Air Component Com- 
mander and head of the Navy Strike Cell. According to Mr. 
Richardson, “The ability to strike quickly and with great 
precision is critical to tactical and strategic success. To launch 
a cruise missile or an aircraft at a target 1,000 miles away and 
receive a Battle Damage Indication (BDI) report in 20 minutes 
will provide dramatic tactical advantage, enabling an irresist- 
ible build-up of momentum for the force possessing that 
capability.” 

Finally, hypersonic vehicles could be configured as a 
manned aircraft to be launched from the decks of aircraft 
Carriers. 

The LoFLYTE project, funded from an Air Force Phase II 
SBIR Contract, is the focal point and primary demonstration 


vehicle for several Phase I and Phase II hypersonic-related 
SBIR contracts funded by the Air Force and by NASA. 

The primary focus of AAC’s ONR-sponsored research, as 
it relates to this project, has been to apply lessons learned from 
biological neural systems to accurate real-time motor control. 
This research has led to a number of technologies applicable 
to the real-time control of complex systems, ranging from 
robotic manipulators to helicopters to hypersonic aircraft. 

Real-time control of such complex systems has required 
not only new, rapidly adaptive neural network algorithms, but 
also hardware which can carry out the necessary computations 
with great speed. The ONR-sponsored work has led directly 
to development of the AAC Neural Network Processor 
(NNP™) discussed in the next section. This processor makes 
possible the extensive and rapid computations necessary to 
control a hypersonic aircraft that will fly at Mach 5. In addition 
to the NNP™, AAC has developed, partially under ONR 
sponsorship, a full Toolbox of neural networks and learning 
methods. Several Toolbox neural networks, including the 
Adaptive Critic [9], have been successfully applied to the 
control of complex systems at AAC. The Toolbox is central to 
the development of the control algorithms being used in 
LoFLYTE. The motor control and motor-mapping algorithms 
developed for ONR are other examples of Navy technology 
which have formed a basis for portions of the LoFLYTE 
aircraft control design.[ 

AAC has developed neural network control algorithms 
inspired by drive-reinforcement theories of animal learning 
and adaptivity. We have applied these methods to the control 
of electrical motors in robotics tasks. By building upon these 
algorithms, we have created an adaptive actuator controller for 
LoFLYTE which adapts to changing loads on LoFLYTE’s 
tiperons and rudders. We have built upon insights into how the 





Figure 6. 


The AAC LoFLYTE waverider aircraft. 
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Figure 7. 


The AAC Neural Network Processor (NNP™), 


Figure 8. 


Block diagram and interprocessor bus architecture for the 
AAC NNP’™. 
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human brain maps control objectives to desired joint motions 
to develop new neural network-based methods for mapping 
flight control objectives to actuator commands. 


Multiple Input, Multiple Data 
(MIMD) Neural Network 
Processor 


As an ongoing part of the research conducted under 
sponsorship by the Naval Research Laboratory and by Naval 
Command, Control, and Ocean Surveillance Center, RDT&E 
Division, Accurate Automation has developed a digital Neural 
Network Processor (NNP™) which is capable of true parallel 
processing. 

The underlying philosophy in the design of the sparse 
Multiple Instruction Multiple Data (MIMD) NNP™ has been 
to achieve maximum computational efficiency in both a single 
processor and multiprocessor environment by optimizing the 
design to compute neuron values very efficiently [10, 11]. This 
is in stark contrast to previously proposed neural network 
processors which are typically based on classical Single In- 
struction Multiple Data (SIMD) matrix/vector multiplication 
architectures. Our design fully exploits the intrinsic sparseness 
of neural network topologies. Moreover, by using a MIMD 
parallel processing architecture, one can update multiple neu- 
rons in parallel with efficiency approaching 100% as the size 
of the neural network increases. 

To achieve the desired efficiency we have adopted a 
design which: 1) Uses an instruction set optimized for neural 
network processing, allowing one to compute a neuron activa- 
tion without arranging the weight matrix into linear arrays 
and/or inserting artificial zero-weighted connection 2) Uses a 
MIMD parallel processing architecture to permit neurons with 
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totally different input topologies to be updated simultaneously 
without loss of efficiency; and 3) Uses dual neuron memories 
to virtually eliminate memory contention and maintain abso- 
lute memory coherence. 

The NNP™ (see Figure 7) is capable of implementing 8K 
total neurons with 32K interconnections per processor. A fully 
configured PC version of the NNP™ is capable of intercon- 
necting 8 modules for a total of 8K neurons and 256K inter- 
connections. In addition to the PC version, the VME version 
is mated with two TI TMS320C40 DSPs which allow for 
real-time data manipulation and processing. The VME card is 
capable of stacking three modules with additional modules 
requiring a separate VME card due to power limitations. The 
NNP™ design is based upon a linked list concept which 
allows any neuron in the network to be connected with any 
other neuron. Thus, any of the recurrent networks, such as the 
Hopfield network, or any feedforward networks such as the 
multilayer perceptron, can be easily implemented using this 
processor. This hardware capability is vitally important in 
control applications because it gives us a neural computation 
engine which is easily adapted to changing control inputs and 
boundary conditions. 

A block diagram of the NNP™ architecture is given in 
Figure 8. The program instructions and associated weights, as 
needed, are stored in memory. For a given operation, the 
instruction is decoded and the necessary value fed from neuron 
memory to the multiplier input unit along with the weight 
value the two values are fed to the Multiply ACcumulator 
(MAC), and the result is passed as an address to the function 
memory. The address is then used to fetch the appropriate 
value from the transfer function. This result is then passed 
though a First-In-First-Out (FIFO) unit and stored in the buffer 
memory for each of the modules via the interprocessor bus. 
When an “interchange neuron and buffer memory” (inbm) 
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instruction is encountered, each processor finishes processing 
its code segment before the buffer and neuron memories are 
interchanged. Once the memories are interchanged, processing 
continues until another inbm or stop instruction is encoun- 
tered. The processor also has the ability to multiply two neuron 
values together. This is done by passing the previous neuron 
value encountered to the MAC as an input and then passing 
the current neuron value as an input to the MAC followed by 
a multiplication operation. The result is passed through the 
transfer function and stored in buffer memory as explained 
previously. 

The fully pipelined implementation of the neural network 
architecture delivers nearly one instruction per cycle which 
allows the neural network board to execute nearly 35 million 
instructions per second. Using the standard definition of a 
connection, a byte-wide multiply-accumulation, the NNP™ 
yields over 140 million connections/sec for a single board. As 
additional modules are added, the speed increases linearly with 
over one billion connections per second possible with a full 
complement of eight processors. 

One of the most vital aspects of parallel computing is the 
ability of the architecture to maintain memory coherence. The 
AAC NNP™ accomplishes memory coherence through the 
use of two separate memories, combined with a special inter- 
change instruction. In most neural network implementations, 
the results from one layer are multiplied by a series of weights 
summed, and then passed through a transfer function, typically 
a sigmoid function, before being used as an output for the 
neurons on the next layer. 

In the NNP™, the inputs are read from neuron memory, 
with the outputs from the transfer functions stored in buffer 
memory. When all of the neurons on a layer have been proc- 
essed an interchange buffer and neuron memory instruction is 
issued which makes the new data available for use by the next 
layer of neurons. 


A particular NNP™, in a worst case scenario, takes four 
clock cycles to access the bus and write a neuron value to the 
buffer memory. On the average, each processor would need to 
access the bus every (f+1) clock cycles, where f is the average 
fan-in to a given neuron in the network. Thus, the number of 
processors allowed before contention occurs is p < (f+1)/4. 
When p > (f+1)/4, then bus contention occurs. A depiction of 
the NNP™ throughput as additional processors is added, 
shown in Figure 9. As can be seen in the figure, the throughput 
increases linearly as processors are added until p > (f+1)/4. 

The inherent speed of the NNP™ in processing sparse 
matrices makes it ideal for computing neural network struc- 
tures such as the cooperative-competitive neural network, 
described in the next section. 


Sensor Fusion 


One of the primary objectives of our work has been to 
develop new sensor data fusion capabilities for the Navy. 
Sensor fusion is a Militarily Critical Technology [12-14]. 
Neural networks present a method by which sensor fusion 
systems can learn from experience, instead of always requiring 
explicitly the a priori probabilities that are currently needed 
for existing (e.g. Bayesian) formalisms. Further, neural net- 
works have the potential to give a system adaptability to 
changing environments and conditions. Finally, neural net- 
works can be implemented in exceptionally fast parallel-proc- 
essing hardware (such as the AAC NNP™), thus overcoming 
the huge computational burden associated with real-time sen- 
sor fusion. Such a neural network-based capability can play a 
crucial role in enabling the Navy to build integrated systems, 
linking together systems which are currently “stovepiped.”[ 

One of the big challenges in sensor data fusion is assign- 
ing each target to the right track. Gates can be used to find out 
what targets are in the vicinity of the expected target positions 
from different tracks. There is still a problem of conflict 
resolution, when different targets could be assigned to two or 
more tracks. We have addressed this challenge by developing 
the novel COoPerative-COMpetitive (COPCOM) neural net- 
work. The neural network determines which items in a given 
Set A have closest similarity to items in another Set B. This 
network operates by making iterative target-to-track assign- 
ments, so that: 

e Targets are assigned where there is maximal closeness 
between a target and the track across a set of matching 
metrics, and simultaneously 

e Targets are assigned where there is minimal conflict 
between a prospective match and other possibly com- 
peting matches. 

In this, it offers a more robust approach to assignment than 

simple non-optimal assignment methods, and unlike the exist- 
ing optimal assignment algorithms, it can be implemented in 
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parallel-processing hardware (e.g. the AAC NNP™) for real 
time solutions to the target-to-track assignment problem [15]. 

The advantage of using the COPCOM network for assign- 
ment tasks is that it makes its first (highest strength) assign- 
ments to those matches which have the overall highest values 
in favor of making the assignment (cooperation across multi- 
ple dimensions of similarity), and which also have the least 
competition with other possible assignments. By making the 
least ambiguous assignments first, the complexity of the over- 
all problem is reduced. This can make it easier to determine 
future assignments. 

The multilayer COPCOM neural network was initially 
developed to deal with one of the major challenges of image 
understanding - that of recognizing objects when they are 
composed of many different parts, are partially obscured, 
and/or have specular reflections. To meet this challenge, an 
early, prototype version of the COPCOM neural network 
identifies those portions in a segmented image which are most 
related to each other. This early version of the COPCOM 
neural network has been applied to images where high specu- 
larities, dark contrasting shadows, and including objects pre- 
sent substantial challenges for image understanding [16,17]. 
The COPCOM neural network has since been redefined to 
create associations between objects in two different sets. This 
can be used for matching objects, or tracking the evolution of 
a multipart system over time. 

Inspiration for the COPCOM design came from the per- 
ceptual psychology of vision, which suggested that many 
factors, ¢.g. similarity of intensity, boundary line continuation, 
and proximity, all played a role in perceptual organization 
[18,19]. 

The COPCOM neural network plays a vital role in target- 
to-track assignment in the AAC Sensor Fusion Tracking Sys- 
tem. Preliminary coarse and fine gating produce a string of 





Figure 10. 


The CoOPerative-COMpetitive (COPCOM) neural network ar- 
chitecture. 
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potential new-target matches for every Master Target Track 
(MTT). The COPCOM method is then used to resolve match- 
ing assignments. A matrix of possible matches to possible 
tracks is used to partition out the problem, so that only the 
subset of new detections which can potentially match a given 
subset of targets is considered at a time. Unique target-to-track 
assignments are made before the COPCOM network is used. 
A search of the COPCOM output nodes for those whose 
activations pass threshold yields an ordered list of non-com- 
peting assignments. The pairwise combinations with the 
strongest activations are listed first. The Tracker uses this 
output to make assignments, prune the remaining possible 
target-to-track assignment possibilities, and rerun COPCOM 
as often as necessary to get a complete set of assignments. 

The cooperative-competitive method has been demon- 
strated to be effective for target-to-track association, even in 
dense target environments. Some of the scenarios for which 
COPCOM effectiveness has been shown include crossing 
targets, splitting targets, and dense targets (e.g. close groups 
of 4, within overall interacting scenarios of 16 proximal tar- 
gets) [20-22]. 

The items to be matched, i.e., the members of Set A and 
Set B (e.g., new detections and tracks), must have the same 
dimension or vector length, denoted N, and the elements in 
these two vectors should have the same meaning. (Of course, 
if matching is being done between elements of the same set, 
then this requirement is automatically satisfied; Set A = Set 
B.) For example, the items of Set A and Set B could each be 
the x and y positions of objects in Euclidean 2-space. We 
denote the nth dimension, of the ith item of Set A as a,', and 
the nth dimension, of the jth item of Set B as b,. Let there be 
a total of I items, i = 1..1 in Set A, and J items, j = 1..J, in Set 
B. The COPCOM network operates on functions of the dis- 
tance between the vector components for each possible pair- 
wise match of members of Sets A and B, over each of the N 
dimensions describing each set member. 

The COPCOM network conceptually consists of four or 
more layers. A basic COPCOM network is shown in Figure 
10. The nodes in Layer 1 represent the individual items them- 
selves. The nodes in Layers 2 and above represent the strength 
of relationships between pairs of items taken from Set A and 
Set B, not to the individual items themselves. This means that 
if Set A has I items, and Set B has J, there are I*J nodes in each 
of N subnets at the second and succeeding layers, to accom- 
modate that number of pairwise relationships. 

COPCOM works by assessing relative similarities be- 
tween items across multiple dimensions. In Layers 2, 3, and 
subsequent intermediate layers, there is a separate subnet for 
each dimension which will contribute to the overall assign- 
ment decision. Thus, if the items in Sets A and B can each be 
described by a 2-D vector (e.g., x and y values), then Layers 
2, 3,... Will each have two subnets; one for each dimension. We 
could call these the X subnet and the Y subnet. This paradigm 





has been instantiated with up to 5 dimensions. The larger the 
number of dimensions, the more effective the assignment 
process is, because more different types of information are 
used in making assignments. 

In the final layer, there is only one subnet. This subnet 
combines the results of the multi-subnet processes in the lower 
layers. The details of the COPCOM neural network are fully 
specified in [20] and results of this network applied to image 
processing are described in [16,17]. 

The inputs used in the first subnet on the first layer are all 
the vector elements. The inputs to the nth subnet on the first 
layer are element values (e.g. position) in the nth dimension. 

The second layer, the D or difference layer, of the COP- 
COM network computes a function of the pairwise differences 
for each of the N dimensions of the items of Sets A and B being 
compared. There are N distinct subnets in the second layer, one 
for each dimension. For each subnet in the second layer, the 
value of a node d,') is computed as 


dii=f (a'n— bin) (14) 


where f may be a Gaussian function or an exponential decay 
function applied to the absolute value of the differences (as 
is done in the current implementation of COPCOM), or any 
other monotonically decreasing even function. The result of 
using this function is that the actual difference between two 
vector elements is scaled to (0,1]. Thus, when the distance 
between two items in any dimension is 0, the strength of the 
node in the subnet for that dimension is 1. As the distance 
between two items increases, the strength value put into the 
Layer 2 subnet node decreases towards 0. This pairwise 
difference metric is computed separately for each dimension. 

There are the same number of subnets and nodes in the 
third and succeeding layers as there arc in the second. The only 
exception is the final layer, which is configured as a single 
subnet, with the same number of nodes as the subnets in the 
previous layers. Each node in the third and succeeding layers, 
the C or cooperative-competitive layers, receives both direct 
inputs from the corresponding node in the previous layer and 
either cooperative or competitive inputs from other nodes in 
the previous layer. 

For the current implementation of the COPCOM network, 
the user selects how many cycles of cooperation and/or com- 
petition are to be used, and in what order they are to be 
employed. This is conceptually equivalent to selecting the 
number of cooperative-compctitive layers which will be used 
in the network. The term cycle refers to a single application of 
either a cooperative or a competitive process. The connections 
between cooperative (excitatory) and compctitive (inhibitory) 
inputs are different. If a cooperative cycle is chosen, the nodes 
c, in the corresponding cooperative-competitive layer receive 
their activation as 


c= dit +E, 


dij+ > ai’ 
f=1...1 f=1...J 
i #i JFj 


If acompetitive cycle is chosen, the nodes c,,4 receive their 
activation as 
c= 45-1, | y, di, (16) 
w=1...R” 
n#n 


The connection strength parameters E, and I, are chosen 
by the user, where q corresponds to a given cycle. version 
of the COPCOM network implemented in the AAC Tracker 
allows the user to select as many cooperative and competitive 
cycles as desired, in any order, and with different weights 
assigned to each cycle. 

The illustration in the accompanying Figure 10 illus- 
trates some of the connections between a d,” node and a 
corresponding ci node. Also, in this figure, both coopera- 
tive and competitive links are shown. In actual operation, 
only one of the two, cooperative or competitive connec- 
tions, would be used for a d--c transition. The user could 
also specify additional cycles, resulting in additional layers, 
each dedicated to either the cooperative or competitive 
process, and typically alternating. 

The dynamics consist of the feedforward flow of activa- 
tion through the connections described in the preceding para- 
graphs. At the next-to-topmost layer, the nodes send their 
activations to the single subnet in the topmost layer, which 
sums cach of the inputs and performs thresholding. The node 
activations here represent the winners in the assignment proc- 
ess. Connection strengths are typically set before the network 
is used, and are not adapted during network use. 


Biologically-Based 
Sensor Fusion Circuit 


The objective of this work is to create and instantiate a 
new, biologically-inspired approach to sensor fusion that will 
have exceptional performance capabilities for the Navy. Using 
novel, special-purpose circuits, we are creating a parallel proc- 
essing sensor fusion capability that will emulate the unique 
multisensor fusion abilities that have recently been discovered 
to exist in the brain. These include the ability to: 

e Correlate detections over a range of spatial registry, 

thus making possible good sensor fusion even with 
inexact sensor registry, 
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e Correlate detections observed at slightly different 
times, thus enabling combination of sensor data from 
sensors with different processing speeds or having 
different times of target observation, and 
Combine weak detections of targets made by different 
sensors, thus yielding higher probability of detection 
and lower false alarm rates. 

In the course of this work, we are producing a unique 
sensor fusion circuit based on biological sensor fusion con- 
cepts. This new robust and generic sensor fusion capability 
will have widespread application throughout the Navy, the rest 
of DoD, and to the private sector. There is currently no similar 
sensor fusion chip available. 

To accomplish this goal, we have formed, under this 
ONR-sponsored Phase II Small Business Technology Transfer 
Research (STTR) project, the first teaming relationship be- 
tween biologists doing sensor fusion research and neural net- 
work specialists in sensor fusion who design both hardware 
and software for the Navy. 

We began this work with the premise that biological 
systems have already solved the sensor fusion problem [23- 
28]. What is specifically useful is that the biological approach 
to sensor fusion has already resolved the key issues which 
press technology development of sensor fusion for DoD. For 
example, biological sensor fusion is designed to use sensors 
which are in “loose” registration with each other. This greatly 
mitigates the need for costly and difficult-to-maintain “tight” 
sensor coregistration or conversion of inputs into a precisely 
correlated common reference frame. Biological sensor fusion 
correlates target observations which appear at slight offsets in 
both time and space - even though they are of the same target. 
The biological approach fuses information from different sen- 
sors to confirm detection of weak targets, even under condi- 
tions of noise. Finally, the biological sensor fusion method has 
the built-in capacity to do context-dependent orienting and 
alerting; it can direct attention even to weak but significant 
stimuli, and it can downgrade response to strong but insignifi- 
cant stimuli. All of these capabilities are highly desirable, if 
not necessary, in sensor fusion systems for DoD. 

Biological sensor fusion takes advantage of parallel proc- 
essing and extensive local-neighborhood connectivity. Al- 
though software emulations of very simple, scaled-down 
versions of the biological system can conceivably be hosted in 
existing serial-processing workstations, the necessary system 
size precludes advanced sensor fusion system designs. Our 
calculations indicate that essentially all the memory and com- 
putational processes in even a high-end workstation will be 
used by a full-scale software emulation. This would not allow 
for design improvements. To overcome this limitation, we 
have designed specialized biologically-inspired sensor fusion 
hardware leading to greater down-the-road flexibility and the 
potential for higher-level performance. 
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One of the primary characteristics of biological sensor 
fusion is that it creates (at the cost of great investment of true 
“neural” wetware) a topographic representation of the sensor- 
observed surrounding space. This topographic representation 
is essential to alerting and orienting the animal, and to fusing 
inputs that are loosely in register with each other. Recent 
research now indicates that extensive connections from the 
higher portions of the brain - the striate cortex through the basil 
ganglia - add the influence of high-level knowledge and con- 
text to processing targets in the superior colliculus [29]. Our 
design preserves these important relationships. Under this 
project, we are creating a special sensor fusion circuit array 
that mimics the sensor fusion processing of the superior colli- 
culus. We envision that high-level knowledge stored in pro- 
grams hosted on a workstation can activate portions of this 
array, thus assisting target detections. 

One of the greatest benefits of this new sensor fusion 
technology for the Navy is its ability to perform context-de- 
pendent target detection. In software emulations of the pro- 
posed sensor fusion circuit, we are demonstrating two 
significant types of embedded context-dependent responses: 





Figure 11. 


Summary diagram illustrating the presumptive relationships 
between the “direct* and ‘indirect’ corticotectal pathways. 
Visual areas of the lateral suprasylvian (LS) cortex provide 
direct excitatory connections to the superior colliculus (SC). 
An indirect stimulus comes about from excitatory connections 
from the LS to the striatum (ST). The excited ST neurons inhibit 
the substantia nigra pars reticulata (SNR) neurons, which in 
turn releases the inhibition they have been exerting on (disin- 
hibits) the SC neurons. The indirect route serves as a mecha- 
nism for “gain controling’ sensitivity of portions of the SC 
[Figure taken from McHaffie et al., 1993] [30] 
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e If a target is detected in a given area, look in the 
neighborhood of the target for other (possibly weaker) 
targets. 

e Ifatarget has been observed in the past, but is now lost, 
look in the broad neighborhood around its last sighting 
for a (possibly weaker) detection. 

Both of these context-dependent behaviors are embedded 
into the “wetware” of the biological sensor fusion system, and 
also are designed into the software emulations and the hard- 
ware that we are building. This capability comes about through 
the “local gain control” which is exerted by the substantia nigra 
pars reticulata (SNR) on the superior colliculus (SC), as shown 
in Figure 11.[30] 

The context-dependent enhanced target detection capabil- 
ity will be very useful to the Navy. One example of how this 
capability can be employed is to consider the current status of 
target tracking in the Detection and Tracking Module (D&T) 
of the Combat Direction Center (CDC). When a tracked target 
makes a sharp maneuver, the current trackers tend to lag the 
radar and Identification Friend or Foe (IFF) returns. The de- 
tector/tracker operator could manually intervene to change the 
nature of the tracker to accommodate maneuvers, especially if 
the target is in a turn. Occasionally, the radar return may be 
weak due to the changing aspect of the target. Our novel sensor 
fusion system has the unique ability to preferentially extract 
weak detections in the neighborhood of a “lost” initial track. 
This means that tracks, even of maneuvering targets, will be 
maintained more readily. Further, if several targets are in the 
neighborhood of each other, they will be detected more readily 
if even only one of them gives a multisensor response (or 
strong single-sensor response). Using the software emulation, 
we will be able to identify parameters that will give the most 
desirable response for different operating criteria, thus enhanc- 
ing the Navy’s ability to wage war under diversified field 
conditions. 


Figure-of-Merit Determinations 


A figure-of-merit is a metric which estimates the effec- 
tiveness or quality of a process or system. One such figure-of- 
merit which is of great interest in this study is a figure-of-merit 
in target identification, or FOM_ID. Such a figure-of-merit can 
help operators know the degree of confidence in a target ID. 
This knowledge can increase effectiveness by reducing misi- 
dentifications and the associated friendly fire incidents. 

The FOML_ID is being constructed from two different 
types of figures-of-merit; one for target identification informa- 
tion, and one in the confidence of uniquely assigning a given 
target report (including the associated ID information) to a 
given track. The latter is a unique consideration in developing 
ID confidence metrics. It will help greatly in alerting operators 
to possible “ID-rub-off” situations, which have been observed 
extensively in major exercises (e.g. the Joint Air Defense 


Operations Joint Engagement Zone (JADO-JEZ) Near-Land 
93 Exercise). ID rub-offs can be a major source of ID conflicts, 
leading to both friendly fire and penetration of Blue regions 
by hostile forces. 

Accurate Automation is conducting research under a re- 
cently-awarded Phase II SBIR, under which four useful fig- 
ures-of-merit will be constructed. These are figures-of-merit 
in: 

e Target Position Error (FOM_TPE) for an individual 
target, which estimates the error in the position esti- 
mate, whether it is constructed using single or mullti- 
sensor data, 

Assignment Confidence (FOM_ASSIGN), which 
gives confidence in how uniquely a new target obser- 
vation can be assigned to a given target track (vis-a-vis 
other neighboring target tracks), 

Combined INFOrmation on target IDentity 
(FOM_ID_INFO), which represents the accumulation 
(over time) of different ID reports for a given target, 
and their confirmation / disconfirmation of a consistent 
ID, and 

Target ID (FOM_ID), which represents the combined 
effect of both target ID information and report assign- 
ment (uniqueness of ID assignment) for a given target. 

The purpose of a “target position error” figure-of-merit 
(FOM_TPE) is to assess the error associated with the state 
position estimate. The FOM_TPE will be useful in determin- 
ing the accuracy to which a target’s position error is known. 
This will be useful in many ways, e.g., assigning size and 
duration of scan for scanning radars tracking maneuvering 
targets, and determining whether the inbound approach of an 
aircraft is within sensor tolerance for certain types of auto- 
mated landing, or whether the landing needs to be down- 
graded. It can be used to assess quality of tracking for weapons 
control. Further, it sets a technical basis for the next step, 
developing a figure-of-merit in target-to-track assignment. 

The purpose of the new target-to-track assignment figure- 
of-merit (FOM_ASSIGN) is to answer the question: To what 
extent can we be assured that the sensor measurements used 
to update a target track apply to that particular track and not to 
another? This question must be answered to have effective 
target ID. 

Miscorrelations, caused by sensor information attached to 
one target track “rubbing off” on a target that comes into 
neighborhood of the first, are a major cause of target mis-iden- 
tifications. This can lead to friendly fire, to allowing an enemy 
to enter secured airspace, and other undesirable events. 

The purpose of FOM_ID_INFO is to express the confi- 
dence of giving a target classification, both in terms of target 
type (e.g. fighter / bomber / etc.) and target nature (friend / foe 
/ neutral / unknown). The target classification will be derived 
from a combination of the different information which con- 
firms / disconfirms target class / ID. It will also be time-vary- 
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ing, as different information becomes available, or as informa- 
tion which is expected does or does not appear. 

Our final step is to construct an overall figure-of-merit 
for target ID (FOM_ID). This FOM makes use of both 
FOM_ID_INFO and FOM_ASSIGN. This is the figure-of- 
merit that will be most directly and frequently useful to 
operators and tactical commanders in a large number of 
scenarios, ranging from pilots / RIOs who need to make fire 
/ no-fire decisions, to battlegroup commanders observing 
targets at a more theatre-level of engagement. 

The concept of a Figure-of-Merit is almost universally 
applicable to assessing system performance. Figures-of- 
merit have been developed and used for GPS receivers and 
for sensor performance evaluation. Simple FOMs are cur- 
rently in use for track quality and for target ID. The neural 
network figure-of-merit technology is directly applicable to 
the task of rate illumination [31]. This approach provides a 
non-parametric means of time-domain and/or frequency 
domain correlation as it is applied in Naval signal process- 
ing systems. 

A figure-of-merit formalism for target-to-track assign- 
ment was proposed as early as 1980 by [32]. He proposed 
a simple algorithmic formalism based on distance between 
the new target observation and the expectation, as well as 
covariances of past observations. Other formalisms are 
similarly algorithmic and reliant on simple functions of 
covariance matrices. AAC’s approach to figure-of-merit 
determination is to investigate the ability of a neural net- 
work to learn correlations between the inputs available to a 
system during operation (sensor observations and predic- 
tions), and information available only during training (the 
difference between state estimate and ground truth). We are 
conducting experiments to determine the extent to which 
these correlations can be embedded into the weight struc- 
ture of a neural network, giving a non-analytic model for 
performance expectations based on recent observations. 

One of the greatest potential uses of figure-of-merit 
technology is as inputs to intelligent agents. Intelligent 
agents are a powerful emerging technology which will be 
used to support individual and collective projects. One of 
the primary uses of intelligent agent technology for the 
Navy will be as advisors and/or assistants to battlefield 
commanders and other military personnel with time-critical 
missions. It is to be noted that intelligent agents constructed 
with neural networks could be tailored for each “warrior” 
to fully support their warfighting needs. Because of the 
unique ability of the neural net to learn, they can be adap- 
tively trained for each individual warrior’s tasks and needs. 
The goal of these intelligent agents will be to help form and 
update situation assessments; to rapidly identify those tar- 
gets and events which will have the greatest impact on their 
user’s mission and safety. Intelligent agents will use fig- 
ures-of-merit to assess the criticality of different targets, 
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and to assess the impact of target interactions with the user, 
each other, and their environment. By offloading some of the 
Situation assessment responsibility, and giving the user that 
information which is most necessary for survival and mission 
completion, the agents will enable their operators to function 
effectively in situations which evolve very rapidly and which 
are confounded by large amounts of available data. 

We envision that the intelligent agents that will be built 
for the Year 2000 and beyond will be systems incorporating 
aspects of both symbolic computation (e.g., classic artificial 
intelligence (AI) methods such as expert systems, black- 
board systems, embedded domain knowledge, etc.) as well 
as the more recently evolved “soft intelligent computing” 
strategies (including neural! networks, fuzzy logic, and ge- 
netic algorithms). This combination of methods will be 
necessary so that the intelligent agent can learn in real-time, 
adapt its rule base to accommodate newly learned rules, 
build a more complete user and domain model, and adapt to 
changes in operating conditions. 


Directions For Future Work 


Application of neural networks to modern warfare re- 
quires that researchers work closely with the warfighter and 
move promising concepts rapidly into fielded use [2]. In 
addition, researchers must keep in mind that the goal is to 
insert technology into in-service systems. To this end, Ac- 
curate Automation’s research efforts are directed towards 
technology transition via participation in Advanced Con- 
cept Initiatives (ACIs). These give the opportunity for rapid 
concept development, demonstration and evaluation, and 
integration into Navy warfighting systems. To assist in this 
mission, AAC develops quality basic research in neural 
networks that have significant applications potential to a 
wide range of both DoD and civilian usage. For example, 
AAC’s new “sensor fusion chip,” being developed under a 
new Phase II Small Business Technology Transfer Research 
(STTR) program, will enable precise multisensor localiza- 
tion of commercial ground vehicles as well as supporting 
DoD multisensor fusion needs. 

Under an NSWC Phase II SBIR Cruise Missile Visuali- 
zation program, we are developing means to display corre- 
lated information in the most useful manner possible. This 
will allow strike commanders and battlefield commanders 
to readily visualize the trajectories of proposed retargeting 
options. 

We are investigating methods for object-based intelli- 
gent systems development, incorporating both “soft intelli- 
gent computing” methods (e.g. neural networks and fuzzy 
logic) and symbolic knowledge representation. These meth- 
ods can be used to automatically supplement standard DoD 
digital databases with the most current information avail- 





able. Intelligentagent technology, operating on these geospa- 
tial vector databases, will not only identify discrepancies 
between new information and old, but identify solutions 
to either update the plan or to request additional informa- 
tion. 

Our neural networks are being applied to UHF satellite 
communication (SATCOM) modems, in a manner that will 
allow greater data throughput and thus enhance communi- 
cations for the Navy’s FLTSAT in the 25 kHz channel. The 
modems’ performances in terms of bit-error-rate are greatly 
improved, allowing the higher throughput to occur. This 
initial modem development replaced the demodulator’s in- 
tegrate-and-dump filter with a neural network matched fil- 
ter that is functionally equivalent to a combined equalizer, 
matched filter, and sequential decoder. The neural network 
based matched filter is matched to spectrally efficient wave- 
forms that have passed through the nonlinear channel. This 
neural network enhanced modem provides receiver bit-er- 
ror-rate performance and data throughput not otherwise 
available on SATCOM links from any fielded communica- 
tions equipment. 

The figure-of-merit development which AAC is con- 
ducting under the ONR-sponsored Phase II SBIR contract, 
described earlier, provides a basis by which intelligent 
agents can evaluate situations. Embedded knowledge in 
intelligent agents will allow them to both “fill in” certain 
aspects of the strike plan (given certain information as 
Starting points) and to identify what information needs to be 
obtained to complete the revised plan. Intelligent agents can 
handle some of the query from the strike platforms back to 
the base, sending and receiving sparsely coded messages 
that rapidly “fill in the blanks” for the new plan. By shifting 
some plan completion tasks to agents, the strike aircrews 
will be able to concentrate their attention on response to the 
immediate environment. 

The combination of these and other technological ad- 
vances will change the nature of real-time retargeting from 
concept to reality. This will make possible the most effec- 
tive use of Navy air assets. 
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Research, he was pursued research into novel forms of synap- 
tic plasticity (neural network models for the distributed repre- 
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cipal components analysis called independent components 
analysis that has widespread applications to underwater com- 
munication, echo cancellation, image compression, and the 
analysis of electroencephalographic recordings. The long- 
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A Digital VLSI 
Architecture for Neural 
Network Emulation 
Pattern Recognition, 
and Image Processing* 


Dan Hammerstrom, Adaptive Solutions, Inc. 


Introduction 


As the other articles of this journal show, the neural 
network model has significant advantages over traditional 
models for certain applications. It has also expanded our 
understanding of biological neural networks by providing a 
theoretical foundation and a sct of functional models. 

Neural network simulation remains a computationally 
intensive activity, however. The underlying computations — 
generally multiply-accumulates — are simple but numcrous. 
For example, in a simple artificial ncural network (ANN) 
model, most nodes are connected to most other nodes, leading 


to O(n’) connections’. Anetwork with 100,000 nodes, modest 
by biological standards, would therefore have about 10 billion 
conncctions, in the simplest models, with a multiply-accumu- 
late operation needed for each connection. If a state-of-the-art 
workstation can simulate roughly 10 million connections per 
second, then one pass through the network takes 1,000 seconds 
(about 20 minutes). This data rate is much too slow for real- 
time process control or speech recognition, which must update 
several times a second. Clearly, we have a problem. 

This performance bottleneck is worse if each connection 
requires more complex computations, for instance for incre- 
mental learning algorithms or for more realistic biological 


* This paper is adapted from a chapter, "A Digital VLSI Architecture for Real-World Applications,” Dan Ilammerstrom, in An Introduction to 
Neural and Electronic Networks - Second Edition, pp. 335-358, Eds S.F. Zornetzer, J.L. Davis, C. Lau, and T. McKenna, Academic Press, 1995. 
' The “order of" O(F(n)) notation means that the quantity represented by O is approximate for the function F within a multiplication or division 


of n by a constant. 
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simulations. Eliminating this computational barrier has lead to 
much research into building custom Very Large Scale Integra- 
tion (VLSI) silicon chips optimized for ANNs. Such chips 
might perform ANN simulations hundreds to thousands of 
times faster than workstations or personal computers — for 
about the same cost. 

The research into VLSI chips for neural network and 
pattern recognition applications is based on the premise that 
optimizing the chip architecture to the computational charac- 
teristics of the problem lets the designer create a silicon device 
offering a big improvement in performance/cost or “opera- 
tions per dollar.” In silicon design, the cost of a chip is 
primarily determined by its two-dimensional area. Smaller 
chips are cheaper chips. Within a chip, the cost of an operation 
is roughly determined by the silicon area needed to implement 
it. Furthermore, speed and cost usually have an inverse rela- 
tionship: faster chips are generally bigger chips. 

The silicon designer’s goal is to increase the number of 
operations per unit area of silicon, called functional density, in 
turn increasing operations per dollar. An advantage of ANNs 
is that they employ simple, low-precision operations requiring 
little silicon area. As a result, chips designed for ANN emula- 
tion can have a higher functional density than traditional chips 
such as microprocessors. The motive for developing special- 
ized chips, whether analog or digital, is this potential to im- 
prove performance, reduce cost, or both. 

The designer of ANN silicon faces many other choices 
and trade-offs. One of the most important is flexibility versus 
speed. At the “specialized” end of the flexibility spectrum, the 
designer gives up versatility for speed to make a fast chip 
dedicated to one task. At the “general purpose" end, the sacri- 
fice is reversed, yielding a slower, but programmable device. 
The choice is difficult because both traits are desirable. Real- 
world neural network applications ultimately need chips 
across the entire spectrum. 

This paper reviews one such architecture, CNAPS? (Con- 
nected Network of Adaptive ProcessorS), developed by Adap- 
tive Solutions, Inc. This architecture was designed for ANN 
simulation, image processing, and pattern recognition. To be 
useful in these related contexts, it occupies a point near the 
“general purpose” end of the flexibility spectrum. We believe 
that, for its intended markets, the CNAPS architecture has the 
right combination of speed and flexibility. 

This paper is divided into two major sections, each framed 
in terms of the capabilities needed in the CNAPS computer’s 
target markets. The first section presents an overview of the 
CNAPS architecture and offers a rationale for its major design 
decisions. It also summarizes the architecture’s limitations and 
describes aspects that, in hindsight, its designers might have 


? Trademark Adaptive Solutions, Inc. 


3 Because ANNs are becoming a key technology, many customers consider their use of ANNs to be proprietary information. Many applications 
are not yet public knowledge. 
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changed. The section ends with a brief discussion of the 
CNAPS program development software. 

The second section briefly reviews applications devel- 
oped for CNAPS at this writing’. The applications discussed 
are simple image processing, automatic target recognition, a 
simulation of the Lynch/Granger Pyriform Model, Kanji OCR, 
Adobe PhotoShop acceleration, and medical image process- 
ing. 


The CNAPS Architecture 


The CNAPS architecture consists of an array of proces- 
sors controlled by a sequencer, both implemented in a chip set 
developed by Adaptive Solutions, Inc. The sequencer is a 
one-chip device, called the CNAPS Sequencer Chip (CSC). 
The processor array is also a one-chip device, available with 
either 64 or 16 processors per chip (the CNAPS-1064 or 
CNAPS-1016). The CSC can control up to eight 1064s or 
1016s, which act like one larger device. 

These chips usually sit on a printed circuit board that plugs 
into a host computer, also called the control processor (CP). 
The CNAPS board acts as a coprocessor within the host. Under 
the coprocessor model, the host sends data and programs to 
the board, which runs until done, then interrupts the host to 
indicate completion. This style of operation is called “run to 
completion semantics." Another possible model is to use the 





Figure 1. 


The basic CNAPS Architecture. CNAPS is a single instruction 
multiple data (SIMD) architecture that uses broadcast input, 
one-dimensional inter-processor communication, and a sin- 
gle, shared output bus. 
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CNAPS board as a stand-alone device to process data continu- 
ously. 


The CNAPS Architecture 


Basic Structure 


CNAPS is a single instruction, multiple data stream 
(SIMD) architecture. SIMD computers have one instruction 
sequencing/control unit and many processor nodes (PNs). In 
CNAPS, the PNs are connected in a one-dimensional array 
(Figure 1) where each PN can “talk” only to its right or left 
neighbors. The sequencer broadcasts each instruction plus 
input data to all PNs, which execute the same instruction at 
each clock. The PNs transmit output data to the sequencer, with 
several arbitration modes controlling access to the output bus. 

As Figure 2 suggests, each PN has a local memory‘, a 
multiplier, an adder/subtractor, a shifter/logic unit, a register 
file,’ and a memory addressing unit. The entire PN uses 
fixed-point, two’s complement arithmetic, and the precision is 
16 bits with some exceptions. The PN memory can handle 8- 
or 16-bit reads or writes. The multiplier produces a 24-bit 
output; an 8x16 or 8x8 multiply takes one clock, and a 16x16 
multiply takes two clocks. The adder can switch between 16- 
or 32-bit modes. The input and output buses are 8 bits wide, 
and a 16-bit word can be assembled (or disassembled) from 
two bytes in two clocks. 

APN has several additional features, [7] and [6], including 
a function that finds the PN with the largest or smallest values 
(useful for winner-take-all and best-match operations), various 
precision and memory control features, and OutBus arbitra- 
tion. These features are too detailed to discuss fully here. 

The CSC sequencer (Figure 3) performs program se- 
quencing for the PN array and has private access to a program 
memory. The CSC also performs I/O processing for the array, 
writing input data to the array and reading output data from it. 
To move data to and from CP memory, the CSC has a 32-bit 
bus, called the AdaptBus, on the CP side. The CSC also has a 
direct input port and a direct output port used to connect the 
CSC directly to I/O devices for higher-bandwidth data move- 
ment. 


Neural Network Example 


The CNAPS architecture can execute many ANN and 
non-ANN algorithms. Many SIMD techniques are the same in 
both contexts, so an ANN can serve as a general example of 
mapping an algorithm to the array. Specifically, the example 
shows how the PN array simulates a layer in an ANN. 


“ Currently 4KB per PN. 
5 Currently 32, 16-bit registers. 





Figure 2. 


The internal structure of a CNAPS processor node (PN). Each 
PN has its own storage and arithmetic capabilities. Storage 
consists of 4,096 bytes. Arithmetic operations include multiply, 
accumulate, logic, and shift. All units are interconnected by 
two buses. 
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Start by assuming a two-layered network (Figure 5) where 
— for simplicity — each node in each layer maps to one PN. PN; 
thus simulates the node n,., where i is the node index in the 
layer and j is the layer index. Layers are simulated in a 
time-multiplexed manner. All layer 1 nodes thus execute as a 
block, then all layer 2 nodes, etc. Finally assume that layer 1 
has already calculated its various n, , outputs. 

The goal at this point is to calculate the outputs for layer 
2. To achieve this, all layer 1 PNs simultaneously load their 
output values into a special output buffer and begin arbitrating 
for the output bus. In this case, the arbitration mode lets each 
PN transmit its output in sequence. In one clock, the content 
of PN,’s buffer is placed on the output bus and goes through 
the sequencer’ to the input bus. From the input bus, the value 
is broadcast to all PNs (this out-to-in loopback feature is a key 
to implementing layered structures efficiently). Each PN then 
multiplies node no ,’s output with a locally stored weight, wo. 

On the next clock, node n, ,’s output is broadcast to all 
PNs, and so on for the remaining layer 1 output values. After 
N clocks, all outputs have been broadcast, and the inner 
product computation is complete. All PNs then use the accu- 


® This operation actually takes several clocks and must be pipelined. These details are eliminated here for clarity. 
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mulated value’s upper 8 bits to look up an 8-bit non-linear 
output value in a 256 item table stored in each PN’s local 
memory. This process — calculating a weighted sum, then 
passing it through a function stored in a table — is performed 
for each output on each layer. The last layer transmits its output 
values through the CSC to an output buffer in the CP memory. 

The multiply-accumulate pipeline can compute a connec- 
tion in each clock. The example network has four nodes and 
uses only four clocks for its 16 connections. For even greater 
efficiency, other operations can be performed in the same clock 
as the multiply-accumulate. The separate memory address 
unit, for instance, can compute the next weight’s address at the 
same time as the connection computation. And the local mem- 
ory allows the weight to be fetched without delay. 

An array of 256 PNs can compute 2562 = 65536 connec- 
tions in 256 clocks. At a 25 MHz clock frequency, this equals 
6.4 billion connections per second (back-propagation feed- 
forward) and over 1 billion connection updates per second 
(back-propagation learning). An array of 64 PNs (one CNAPS- 
1064 chip), for example, can store and train the entire NetTalk 
[18] network in about 7 seconds. 





Figure 3. 


The CNAPS sequencer chip (CSC) internal structure. The CSC 
accesses an external program store, which contains both CSC 
and CNAPS PN array instructions. PN array instructions are 
broadcast to all PNs. CSC instructions control sequencing and 
all array input and output. 
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Physical Implementation 


The CNAPS PN array has been implemented in two chips, 
one with 64 PNs (the CNAPS-1064[4]) (Figure 5) and the 
other with 16 PNs (the CNAPS-1016). Both chips are imple- 
mented in a 0.8 micron CMOS process. The 64 PN chip is a 
full custom design and is about 26 millimeters on a side and 
has over 14 million transistors, making it one of the largest 
processor chips ever made. The simple computational model 
makes possible a small, simple PN, in turn permitting the use 
of redundancy to improve semiconductor yield for such a 
device. 

The CSC is implemented using a gate array technology, 
using a 100,000 gate die and is about 10 millimeters on a side. 

The next section reviews the various design decisions and 
the reasons for making them. Some of the features described 
are unique to CNAPS; others apply to any Digital Signal 
Processor chip. 


Major Design Decisions 


When designing the CNAPS architecture, a key question 
was where it should sit relative to other computing devices in 
cost and capabilities. In computer design, flexibility and per- 
formance are almost always inversely related. We wanted 
CNAPS to be flexible enough to run a broad family of ANN 
algorithms as well as other related pattern recognition and 
preprocessing algorithms. Yet we wanted it to have much 
higher performance than state-of-the-art workstations and — at 
the same time — lower cost for its functions. 

Figure 6 shows where we are targeting CNAPS. The 
vertical dimension plots each architecture by its flexibility. 
Flexibility is difficult to quantify, since it involves not only the 
range of algorithms that an architecture can execute, but also 
the complexity of the problems it can solve. (Greater complex- 
ity typically requires a larger range of operations.) As a result, 
this graph is subjective and provided only as an illustration. 

The horizontal dimension plots each architecture by its 
performance/cost — or operations per dollar. The values are 
expressed in a log scale due to the orders-of-magnitude differ- 
ence between traditional microprocessors at the low end and 
highly-custom, analog chips at the high end. Note the technol- 
ogy barrier, Jefined by practical limits of semiconductor 
technology. No one can build past the barrier, since you can 
do only so much with a transistor; you can put only so many 
of them on a chip; and you can run them only so fast. 

For ANN emulation, we wanted to place the CNAPS 
architecture in the middle, between the specialized analog 
chips and the general purpose microprocessors. We wanted it 
to be programmable enough to solve many real-world prob- 
lems — and yet have a performance/cost about 100 times faster 
than the highest performance RISC processors. 

In determining the degree of function required, we must 
solve all or most of a targeted problem. This need results from 





Amdahl’s Law, which states that system performance depends 
mainly on the slowest component. This law can be formalized 
as follows: 


1 
S= 
(ops * Sp) + (Opp, * Sp) 





where S is the total system speed-up, op, is the fraction of 
total operations in the part of the computation run on the 
fast chip, s, is the speed-up the chip provides, op,, is the 
fraction of total operations run on a host computer without 
acceleration. Hence as op, or s, get large, S approaches 1/op,,. 
Unfortunately, op, needs to be close to one before any real 
system-level improvement occurs, as shown in the following 
example. 

Suppose there are two such support chips to choose from: 
the first can run 80% of the computation with 20x improve- 
ment on that 80%; the second can run only 20%, but runs that 
20% 1000x faster. By Amdahl’s law, the first chip speeds up 
the system by over 400%, while the second — and seemingly 
faster — chip speeds up the system by only 20%. Amdahl tells 
us, therefore, that flexibility is often better than raw perform- 
ance, especially if that performance results from limiting the 
range of operations performed by the device. 


Digital 

Much effort has been dedicated to building analog VLSI 
chips for ANNs. Analog chips have great appeal, partly be- 
cause they follow biological models more closely than digital 
chips. Analog chips also can achieve higher functional density. 
Excellent papers reporting research in this area include [12], 
{1}, [5], [10], and [2]. And see Morgan, [15], for a good 
summary of digital neural network emulation. 

Analog ANN implementations have been primarily aca- 
demic or industrial research projects, however. Only a few 
have found their way into the real world as commercial prod- 
ucts: getting an analog device to work in a laboratory is one 
thing; making it work over a wide range of voltages, tempera- 
tures, and user capabilities is another. In general, analog chips 
require much more stringent operating conditions than digital 
chips. They are also more difficult to design and, after imple- 
mentation, less flexible. 

The semiconductor industry is heavily oriented toward 
digital chips. Analog chips represent only a minor part of the 
total output, reinforcing their secondary position. There are, of 
course, successful analog parts, and there always will be, since 
some applications require analog’s higher functional density 
to achieve their cost and performance constraints, and those 
applications can tolerate analog’s limited flexibility. Likewise, 
there will be successful products using analog ANN chips. 
Analog parts will probably be used in simple applications or 


as a part of larger system in more complex applications, 
however. 

This prediction follows primarily from their limited flexi- 
bility. Analog chips typically implement one algorithm hard- 
wired into the chip. A hardwired algorithm is fine if truly 
stable. The field of ANN applications is still new, however, so 
most complex implementations are still actively evolving — 
even at the algorithm level. An analog device cannot easily 
follow such changes. A digital, programmable device can 
change algorithms by changing software. 

Our major goal was to produce a commercial product that 
would be flexible enough and provide sufficient precision to 
cover a broad range of complex problems. This goal dictated 
a digital design, since digital could offer better precision and 
much more flexibility than a typical CMOS analog implemen- 
tation. Digital also offered excellent performance and all the 
advantages of a standardized technology. 





Figure 4. 


A simple two-layered neural network. In this example each PN 
emulates two network nodes. PNs emulate the first layer, 
computing one connection each clock. They then sequentially 
place node output on the OutBus while emulating, in parallel, 
the second layer. 
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Limited, Fixed-Point Precision 


In both analog and digital domains, an important decision 
is choosing the arithmetic precision required. In analog, pre- 
cision affects design complexity and the amount of compen- 
sation circuitry required. In digital, it effects the number of 
wires available as well as the size and complexity of memory, 
buses, and arithmetic units. Precision also affects the power 
dissipation. 

In the digital domain, a related decision involves floating- 
point versus fixed-point representation. Floating-point num- 
bers (Figure 7) consist of an exponent (usually 8 bits 
representing base 2 or base 16) and a mantissa (usually 24 
bits). The exponent is set so that the mantissa is always 
normalized — that is, the most significant one is in the most 
significant position. Adding two floating-point numbers in- 
volves shifting at least one of the operands to get the same 
exponent. Multiplying two floating-point numbers involves 
separate arithmetic on both exponents and mantissas. Both 
operations require post-operation normalization shifting after 
the arithmetic operations. 

Floating point has several advantages. The primary ad- 
vantage is dynamic range, which results from the separate 
exponent. Another is precision, due to the 24-bit mantissas. 
The disadvantage to floating point is its cost in silicon area. 
Much circuitry is required to keep track of both exponents and 
mantissas and perform pre- and post-operation shifting of the 
mantissa. This circuitry is particularly complicated if high 
speed is required. 





Figure 5. 


The CNAPS PN array chip. There are 64PNs with memory on 
each die. The PN array chip is one of the largest processor 
chips ever made. It consists of 14 million transistors and is over 
26 millimeters on a side. PN redundancy, there are 16 spare 
PNs, is used to guarantee high yields. 
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Fixed-point numbers consist of a numeral (usually 16 to 
32 bits) and a radix point (in base two, the binary point). In 
fixed point, the programmer chooses the position of the radix 
point. This position is typically fixed for the calculation, 
although it is possible to change the radix point under software 
control by explicitly shifting operands. For many applications 
needing only limited dynamic range and precision, fixed point 
is sufficient. It is also much cheaper than floating point because 
it requires less silicon area. 

After choosing a digital signal representation for CNAPS, 
the next question was how to represent the numbers. Biologi- 
cal neurons are known to use relatively low precision and to 
have a limited dynamic range. These characteristics strongly 
suggest that a digital computer for emulating ANN structures 
should be able to employ limited precision fixed-point arith- 
metic. This conjecture in turn suggests an opportunity to 
significantly simplify the arithmetic units and to provide 
greater computational density. Fixed-point arithmetic also 
places the design near the desired point on the flexibility versus 
performance/cost curve (Figure 6). 

To confirm the supposition that fixed-point is adequate, 
we performed extensive simulations. We found that for the 
target applications, 8- or 16-bit fixed-point precision was 
sufficient{3]. Other researchers have since reached the same 
conclusion, [9] and [19]. In keeping with experimental results, 
we used a general 16-bit resolution inside the PN. One excep- 
tion was using a 32-bit adder to provide additional head room 
for repeated multiply-accumulates. Another was using 8-bit 
input and output data buses, since most computations involve 
8-bit data and 8- or 16-bit weights, and since busing external 
to the PN is expensive in silicon and board area. Using 16 bits 
for the buses internal to the PN did not add that much extra 
area. 


SIMD 


The next major decision was how to control the PNs. A 
computer can have one or more instruction streams and one or 
more data streams. Most computers are single instruction, 
single data (SISD) computers. These have one control unit and 
one processor unit, usually combined on one chip (a micro- 
processor). The control unit fetches instructions from program 
memory and decodes them. It then sends data operations such 
as add, subtract, or multiply to the processing unit. Sequencing 
operations, such as branch, are executed by the control unit 
itself. SISD computers are serial, not parallel. 

Two major families of parallel computer architectures 
have evolved: multiple instruction, multiple data (MIMD) and 
single instruction, multiple data (SIMD). MIMD computers 
have matiy processing units, each of which has its own control 
unit. Each control/processing unit can operate in parallel, 
executing many instructions at once. Since the processors 
operate independently, MIMD is the most powerful and flex- 
ible parallel architecture. The independent, asynchronous 





processors also make MIMD the most difficult to use, requir- 
ing complex processor synchronization. 

SIMD computers have many processors but only one 
instruction stream. All processors receive the same instruction 
at the same time, but each acts on its own slice of the data. 
SIMD computers thus have an array of processors and can 
perform an operation on a block of data in one step. SIMD 
computing is often called “data parallel" computing, since it 
applies one control thread to multiple, local data elements, 
executing one instruction at each clock. 

SIMD computation is perfect for vector and matrix arith- 
metic. Due to Amdahl’s law, however, SIMD is cost effective 
only if most operations are matrix or vector operations. For 
general-purpose computing, that is not the case. Consequently, 
SIMD machines are poor general-purpose computers and rarer 
than SISD or even MIMD computers. Our target domain is not 
general-purpose computing, however. For ANNs and other 
image and pattern recognition, and signal processing algo- 
rithms, the dominant calculations are vector or matrix opera- 
tions. SIMD fits this domain perfectly. 





Though subjective, this graph gives a rough indication of the 
CNAPS market positioning. The vertical dimension measures 
the range of functionality of an architecture; the horizontal 
dimension, the performance/cost in operations per dollar. The 
philosophy behind CNAPS is that by restricting functionality to 
pattern recognition, image processing, and neural network 
emulation, a larger performance/cost is possible than with 
traditional machines (parallel or sequential). 
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SIMD is a good choice for practical reasons too. One 
advantage is cost: SIMD is much cheaper than MIMD, since 
there is only one control unit for the entire array of processors. 
Another is that SIMD is easier to program than MIMD, since 
all processors do the same thing at the same time. Likewise, it 
is easier to develop computer languages for S'MD, since it is 
relatively easy to create parallel data structures where the data 
are operated on simultaneously. Figure 8 shows a simple 
CNAPS-C program that multiplies a vector times a matrix. 
Normally, vector matrix multiply takes n2 operations. By 
placing each column of the matrix on each PN, it takes n 
operations on n processors. 

In sum, SIMD was better than MIMD for CNAPS because 
it fit the problem domain, was much more economical, and 
easier to program. 


Broadcast Interconnect 


The next decision concerned how to interconnect the PNs 
for data transfer both within the array and outside it. Computer 
architects have developed several interconnect structures for 
connecting processors in multiprocessor systems. Since 
CNAPS is a SIMD machine, we were interested only in 
synchronous structures. 

The two families of interconnect are local and global. 
Local interconnect attaches only neighboring PNs. The most 
common local scheme is NEWS (North-East-West-South — 
Figure 9). In NEWS, the PNs are laid out in a two-dimensional 
array, and each PN is connected to its four nearest neighbors. 
A one-dimensional variation connects each PN only to its left 
and right neighbors. 

Global interconnect permits any PN to talk to any other 
PN, not just to its immediate neighbors. There are several 
possible configurations with different levels of perform- 
ance/cost. At one end of the scale, cross-bar interconnect is 
versatile since it permits random point-to-point communica- 
tions, but expensive (the cost is O( n’), where n is the number 
of PNs). At the other end, broadcast interconnect is cheaper 
but less flexible. Here, one bus connects all PNs, so any one 
PN can talk to any other (or set of others) in one clock. On the 
other hand, it takes n clocks for all PNs to havea turn. The cost 
is O(1). In between crossbar and broadcast are other configu- 
rations that can emulate a crossbar in O(log n) clocks and have 
cost O(m log n). 

Choosing an interconnect structure interacted with other 
design choices. We reached a crossroads by deciding against 
using a systolic computing style, where operands, intermediate 
results, or both flow down a row of PNs only using local 
interconnect. Systolic arrays are harder to program. They are 
also occasionally inefficient due to the clocks needed to fill or 
empty the pipeline — peak efficiency occurs only when all PNs 
see all operands. Choosing a systolic array would have permit- 
ted us to use local interconnect, saving cost. Deciding against 
it forced us to provide some form of global interconnect. 
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Choosing “global” leads to the next choice: what type? 
The basic computations in our target applications require 
“one-to-many” or “many-to-many” communication almost 
exclusively. We therefore decided to use a broadcast bus, 
which uses only one clock for one-to-many communication. 
In the many-to-many case, n PNs can talk to all n PNs inn 
clocks. Broadcast interconnect thus allows n* connections in 
n clocks. Such O(n?) total connectivity occurs often in ANN 
models. An example is a back-propagation network in which 
all nodes in one layer connect to all nodes in the next. 

Another advantage is that broadcast interconnection is 
synchronous and fits the synchronous SIMD structure quite 
well. We were able to use a “slotted” protocol, where each 
connection occurs at a known time on the bus. Since the time 
is known, there is no need to send an address with cach data 
element, saving wires, clocks, or both. Also, the weight ad- 
dress unit can “remember” the slot number and use it to address 
the weight associated with the connection. 

A single broadcast bus is simple, economical to imple- 
ment, and efficient for the application domain. In fact, if every 
PN always communicates with every other PN, then broadcast 
offers the best possible performance/cost. 

Broadcast interconnection does have some drawbacks. 
One problem is its inefficiency for some point-to-point com- 
munication patterns, where one PN talks with another PN 
anywhere in the array. An example of such a paticrn is the 
“perfect shuffle” used by the fast Fourier transform (FFT) 
(Figure 10). This pattern takes n clocks on the CNAPS broad- 
cast bus and is too slow to be effective. Consequently, CNAPS 
implements the compute-intensive, discrete Fourier transform 
(DFT) instead of the communication-intensive FFT. The DFT 
requires O(n?) operations; the FFT, O(n log n). If n=p, where 
p is the number of PNs, then CNAPS can perform a DFT in 
O(n) clocks, however. If logn=p, then performance can ap- 
proach the O(n log n) of a sequential processor. 

Another problem involves computation localized in a 
portion of an input vector, where cach PN operates on a 
different (possibly overlapping) subset of the elements. Here, 
all PNs must wait for all inputs to be broadcast before any 
computation can begin. A common example of this situation 
is the limited receptive field structure, often found in image 
classification and character recognition networks. The convo- 
lution operation, also common in image processing, uses simi- 
lar localized computation. The convolution can proceed 
rapidly after some portion of the image has been input into 
each PN, since each PN operates independently on its subset 
of the image. 

When these subfields overlap (as in convolution), a PN 
must communicate with its neighbors. To improve perform- 


ance for such cases, we added a one-dimensional inter-PN 
pathway, connecting each PN to its right and left neighbors. 
(One dimension was chosen over two to allow processor 
redundancy, discussed further below.) The CNAPS array 
therefore has both global (broadcast) and local (inter-PN) 
interconnection. An example of using the inter-PN pathway 
might be image processing, where a column of each image is 
allocated to each PN. The inter-PN pathway permits efficient 
communication between columns, and, consequently, efficient 
computation of most image processing algorithms. 

A final problem is sparse random interconnect, where 
each node connects to some random subset of other nodes. 
Broadcast is, from the viewpoint of the connected PNs, effi- 
cient in this case. Nonetheless, when a sparse connectivity is 
used with a slotted protocol, many PNs are idle, since they lack 
weights connected to most inputs and cannot use most of the 
data being broadcast. Sparse interconnect affects all aspects of 
the architecture, not just data communication. To improve 
efficiency for sparsely-connected networks, the CNAPS PN 
offers a special memory technique called virtual zero, which 
saves memory locations, which would otherwise be filled with 
zeros, by not loading zeros into memory for unused connec- 
tions. The Virtual Zero technique does not help the idle PN 
problem, however. Full efficiency with sparse interconnect 
requires a much more complex architecture, including more 
individualized control per PN, more complex memory refer- 
encing capabilities, etc. and its discussion is beyond the scope 
of this paper. 





Figure 7. 


A floating point number. A single precision, IEEE compatible 
floating point configuration is shown. The high order 8 bits 
constitute the exponent; the remaining 24 bits, the mantissa or 
“fractional” part. Floating point numbers are usually normalized 
so that the mantissa has a 1 in the most significant position. 
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7 For most implementations the bit rate per pin is roughly equal to the clock rate, which can vary anywhere from 25 to 200 MHz. There are some 
special interface protocols which now allow up to 500 Megabits per second per pin, but power dissipation limits how many bits can be sent off chip at 


those frequencies. 
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Figure 8. 


ACNAPS-C program to do a simple vector-matrix multiply. The 
‘data-parallel” programming is evident here. Within the loop, it 
is assumed, because of the domain declaration, that there are 
multiple copies of each matrix element, one on each PN. The 
program takes N loop iterations, which would require N ona 
sequential machine. 





# define N 20 

# define K 30 

typedef scaled 8 8 arithType; 

domain Krows 
{arithType sourceMatrix[N]; 
arithType resultVector;} dimK[K]; 


main() 
{ int n; 
[domain dimK].{ 
resultVector = 0; 
for (n=0; n < N; n++) 
resultVector += sourceMatrix[n] * getchar(); 


} 





On-Chip Memory 


One of the most difficult decisions was whether to place 
the local memory on-chip (inside the PN) or off-chip. Both 
approaches have advantages and drawbacks. It was a complex 
decision with no obvious right answer and little opportunity 
for compromise. 

The major advantage of off-chip memory is that it allows 
essentially unlimited memory per PN. Placing memory inside 
the PN, in contrast, limits the available memory because 
memory takes significant silicon area. Increasing PN size also 
limits the number of PNs. Another advantage to off-chip 


8 CNAPS-C is a data parallel version of the standard C language. 


memory is that it allows the use of relatively low-cost, com- 
mercial memory chips. On-chip memory, in contrast, increases 
the cost per bit — even if the memory employs a commercial 
memory cell. 

The major advantage of on-chip memory is that it allows 
much higher bandwidth for memory access. To see bandwidth 
as a crucial factor, consider the following analysis. Recall that 
each PN has its own data arithmetic units, so each PN requires 
a unique memory data stream. The CNAPS-1064 has 64 PNs, 
each potentially requiring up to two bytes per clock. At 25 
MHz, that is 25M * 64 * 2 = 3.2 billion bytes per second. 
Attaining 3.2 billion bytes per second from off-chip memory 
is difficult and expensive due to limits on the number of pins 
per chip and the data rate per pin. An option would be to reduce 
the number of PNs per chip, eroding the benefit of maximum 
parallelism. 

Another advantage to on-chip memory is that each PN can 
address different locations in memory each clock. Systems 
with off-chip memory, in contrast, typically require all PNs to 
address the same location for each memory reference to reduce 
the number of external output pins. With a shared address only 
a single set of address pins is required for an entire PN array. 
Allowing each PN to have unique memory addresses, requires 
a set of address pins for each PN, which is expensive.. Yet 
having each PN address its own local memory improves 
versatility and speed, since table lookup, string operations, and 
other kinds of “indirect” reference are possible. 

Yet another advantage is that the total system is simpler. 
On-chip memory makes it possible to create a complete system 
with little more than one sequencer chip, one PN array chip, 
and some external RAM or ROM for the CSC program. 
(Program memory needs less bandwidth than PN memory 
because SIMD machines access it serially, one instruction per 
clock.) 

It is possible to place a cache in each PN, then use off-chip 
memory as a backing store, which attempts to gain the benefits 
of both on-chip and off-chip memory by using aspects of both 
designs. Our simulations on this point verified what most 
people who work in ANNs already suspected: caching is 
ineffective for ANNs due to the non-locality of the memory 
references streams. Caches are effective if the processor re- 
peatedly accesses a small set of memory locations, called a 
working set. ANNs rarely exhibit that kind of behavior; in- 
stead, they reference long, sequential vector arrays (generally 
weights). 

Separate PN memory addressing also reduces the benefit 
of caching. Unless all PNs refer to the same address, some PNs 
can have a cache miss and others not. If the probability of a 
cache miss is 10% per PN, then a 256 PN array will most likely 
have a cache miss every clock. But due to the synchronous 
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SIMD control, all PNs must wait for the one or more PNs that 
miss the cache. This behavior renders the cache useless. A 
MIMD structure overcomes the problem — but increases sys- 
tem complexity and cost. 

As this discussion suggests, local PN memory is a com- 
plex topic with no easy answers. Primarily due to bandwidth 
needs and the access to a commercial density static RAM 
CMOS process, we decided to implement PN memory on chip, 
inside the PN. Each PN has 4 KB in the current 1064 and 1016 
chips. 

CNAPS is the only architecture for ANN applications we 
are aware of that uses on-chip memory. Several designs have 
been proposed that use off-chip memory. The CNS system 
being developed at Berkeley [21], for instance, restricts the 
number of PNs to 16 per chip. It also uses a special high-speed 
PN-to-memory bus to achieve the necessary bandwidth. An- 
other system, developed by Ramacher and others at Siemens, 
[16] uses a special systolic pipeline that reduces the number 
of fetches required by forcing each memory fetch to be used 
several times. This organization is efficient at doing inner 
products, but has restricted flexibility. HNC has also created a 
SIMD array called the SNAP [14]. It uses floating-point arith- 
metic, reducing the number of PNs on a chip to only four — in 
turn reducing the bandwidth requirements. 

The major problem with on-chip memory is its limited 
memory capacity. While this limitation does restrict CNAPS 
applications, it has not been a major problem. With early 
applications, the performance/cost advantages of on-chip 


memory have been more important than the memory capacity 
limits. 


Redundancy for Yield Improvement 


During the manufacture of integrated circuits, small de- 
fects and other anomalies occur, causing some circuits to 
malfunction. These defects have a more-or-less random distri- 
bution on a silicon wafer. The larger the chip, the greater the 
probability that at least one defect will occur there during 
manufacturing. The number of good chips per wafer is called 
the yield. As chips get larger, fewer chips fit on a wafer, and 
more have defects, therefore, yield drops off rapidly with size. 
Since wafer costs are fixed, cost per chip is directly related to 
good chips per wafer. The result is that bigger chips cost more. 
On the other hand, bigger chips do more, and their ability to 
fit more function into a smaller system makes big chips worth 
more. Semiconductor engineers are constantly pushing the 
limits to maximize both function and yield at the same time. 

One way to build larger chips and maximize yield is to 
use redundancy, where many copies of a circuit are built into 
the chip. After fabrication, defective circuits are switched out 





Figure 9. 


A two-dimentional PN layout. This configuration is often called, 
a "NEWS* network, since each PN connects to its north, east, 
west, and south neighbor. These networks provide more flex- 
ible intercommunication than a one-dimentional network, but 
are difficult to make work when redundant PNs are used. 
























































































































































and replaced with a good copy. Memory designers have used 
redundancy for years: extra memory words are fabricated on 
the chip and substituted for defective words. With redundancy, 
some defects can be tolerated and still yield a fully functional 
chip. 

One advantage of building ANN silicon is that each PN 
can be simple and small. In the CNAPS processor array chip, 
the PNs are small enough to be effective as “units of redun- 
dancy.” By fabricating spare PNs, we can significantly im- 
prove yield and reduce the cost per PN. The 1064 has 80 PNs 
(in an 8x10 array), and the 1016 has 20 (4x5). Even with a 
relatively high defect density, the probability of at least 64 out 
of 80 (or 16 out of 20) PNs being fully functional is close to 
1.0. CNAPS is the first commercial processor to make exten- 
sive use of such redundancy to reduce costs. Without redun- 
dancy, the processor array chips would have been smaller and 
less cost-effective. We estimate a CNAPS implementation 


°To change algorithms, the CSC need only branch to a different section of a program. 
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usingredundancy hasabout a two-times performance/cost 
advantage over one lacking redundancy. 

Redundancy also influenced the decision to use limited- 
precision, fixed-point arithmetic. Our analyses showed that 
floating-point PNs would have been too large to leverage 
redundancy; hence, floating point would have been even more 
expensive than just the size difference (normally about a factor 
of four) indicates. 

Redundancy also influenced the decision to use one-di- 
mensional inter-PN interconnect. One-dimensional intercon- 
nect makes it relatively easy to implement PN redundancy, 
since any 64 of the 80 PNs can be used. Two-dimensional 
interconnect complicates redundancy and was not essential for 
our applications. We chose one-dimensional interconnect, 
since it was adequate for our applications and does not impact 
the PN redundancy mechanisms. 


Limitations 


In retrospect, we are satisfied with the decisions made in 
designing the CNAPS architecture. We have no regrets about the 
major decisions such as the choices of digital, SIMD, limited 
fixed point, broadcast interconnect, and on-chip memory. 

The architecture does have a few minor bottlenecks that 
will be improved in future versions. For example, the 8-bit 
input/output buses should be 16-bit. In line with that, a true 
one-clock 16x16 multiply is needed, as well as better support 
for rounding. And future versions will have higher frequencies 
and more on-chip memory. A hardware based random number 
generator for each PN would also be useful for many ANN 
emulation tasks. Despite these few limitations, the architecture 
has been successfully applicd to several applications with 
excellent performance. 


Product Realization and Software 


Adaptive Solutions has created a complete development 
software package for CNAPS. It includes a library of important 
ANN algorithms and aC compiler with a library of commonly 
used functions. Several board products are now available and 
sold to customers to use for ANN emulation, image and signal 
processing, and pattern recognition applications. 


CNAPS Applications 


This section reviews several CNAPS applications. Its 
focus is on ANN and non-ANN applications. Some applica- 
tions mix ANN and non-ANN techniques. For example, an 
application could preprocess and enhance an image via 
standard imaging algorithms, then use an ANN classifier on 
segments of the image, keeping all data inside the CNAPS 
array for all operations 


Back-Propagation 


The most popular ANN algorithm is back-propagation 
(BP) [17]. Although requiring large computational resources 
during training, BP has several advantages that make it a 
valuable algorithm: 


* BP is reasonably generic, meaning that one network 
model (emulation program) can be applied to a wide 
range of applications with little or no modification; 
its non-linear, multilayer architecture lets it solve com- 
plex problems; 

BP is relatively easy to use and understand; and 
several commercial software vendors have excellent 
BP implementations. 


It is estimated that over 90% of the ANN applications in 
use today use BP or some variant of it. We therefore felt that 
it was important for CNAPS to execute BP efficiently. This 
section briefly discusses the general implementation of BP on 
CNAPS. For more detail, see McCartor [11]. 





Figure 10. 


The intercommunication pattern of a fast Fourier transform 
(FFT) A butterfly intercommunication pattern for four nodes. 
This pattern is difficult to do in less than N clocks (where N is 
the number of nodes) with broadcast inter-communications. 
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Figure 11. 


A back-propagation network with five inputs, four hidden 
nodes and two output nodes. 
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There are two CNAPS implementations of BP, a single- 
precision version (BP16) and a double-precision version 
(BP32). BP16 uses unsigned 8-bit input and output values 
and signed 16-bit weights. The activation function is a 
traditional sigmoid, implemented by table lookup. BP32 
uses signed 16-bit input and output values and signed 32-bit 
weights. The activation function is a hyperbolic tangent 
implemented by table lookup for the upper 8 bits and by 
linear extrapolation for the lower 8 bits. All values are fixed 
point. We have found that BP 16 is sufficient for all classi- 
fication problems. BP 16 has also been sufficient for most 
curve fitting problems, such as function prediction, which 
have more stringent accuracy requirements. In those cases 
where BP16 does not have the accuracy of floating point, 
BP32 is as accurate as floating point in all cases studied so 
far. The rest of this section focuses on the BP16 algorithm. 
It does not discuss the techniques involved in dealing with 
limited precision on CNAPS. 

BP has two phases. The first is feed-forward operation, 
where the network passes data without updating weights. The 
second is error back-propagation and weight update during 
training. Each phase will be discussed separately. This discus- 
sion assumes that the reader already has a working under- 
standing of BP. 
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Back-Propagation: Feedforward Phase 


Assume a simple CNAPS system with four PNs and a BP 
network with five inputs, four hidden nodes, and two output 
nodes (34 total connections, counting a separate bias connec- 
tion for each node) (Figure 11). Allocate nodes 0 and 4 to PNO, 
nodes 1 and 5 to PN1, node 2 to PN2, and node 3 to PN3. When 
a node is allocated to a PN, the local memory of that PN is 
loaded with the weight values for each of the node’s connec- 
tions and with the lookup table for the sigmoid function. If 
learning is to be performed, then each connection requires a 
two-byte weight plus two bytes to accumulate the weight 
deltas, and a 2-byte transpose weight (discussed below). This 
network then requires 204 bytes for connection information 
and 256 bytes for the lookup table. Using momentum — ignored 
here for simplicity — would require more bytes per connection. 

Each input vector contains five elements. To start the 
emulation process, each element of the input vector is read 
from an external file by the CSC and broadcast over the Inbus 
to all four PNs. PNO performs the multiply vy * w1g9; PN1, vo 
* w1,; etc. This happens in one clock. In the next clock, v, is 
broadcast, PNO computes v, * wlo,, PN1, v, * wl,;, etc. 
Meanwhile, the previous clock’s products are sent to the adder, 
which contains zero initially. 

All hidden-layer products have been generated after five 
clocks. One more clock is required to add the last product to 
the accumulating sum (ignoring the bias terms here for sim- 
plicity). Next, ail PNs take the most-significant byte out of the 
product and use it as an address into the lookup table to get the 
sigmoid output. The read value then is put into the output 
buffer, and the PNs are ready to compute the output node 
outputs. 

The next step is computing the output-layer node values 
(nodes 4 and 5). In the first clock, PNO transmits its output 
(node 0’s output) onto the output bus. This value goes through 
the CSC and comes out on the Input bus, where it is broadcast 
to all PNs. Although only PNO and PN1 are used, all PNs 
compute values (PN2 and PN3 compute dummy values). PNO 
and PN1 compute ny * W259 and ny * w2vo,. In the next clock, 
node 1’s value is broadcast and n, * w2,, and n, * w2,, are 
computed, and so on. After four clocks, PNO and PN1 have 
computed all products. One more clock is needed for the last 
addition; then, a sigmoid table lookup is performed. Finally, 
the node 4 and 5 outputs are transmitted sequentially on the 
Outbus, and the CSC writes them into a file. 

Let a connection clock be the time it takes to compute one 
connection. For standard BP, a connection requires a multiply- 
accumulate plus, depending on the architecture, a memory 
fetch of the next weight, the computation of that weight’s 
address, etc. For the CNAPS PN, aconnection clock takes one 
cycle. On a commercial microprocessor chip, a connection 
clock can require one or more cycles, since many commercial 
chips cannot simultaneously execute all operations required to 
compute a connection clock: weight fetch, weight address 








Figure 12. 


A schematicized version of the three layer LVQ network that 
Sharp uses in their Kanji OCR system. The character is pre- 
sented as a 16x16 or 256 element system. Some characters 
are recognized immediately; others are merely grouped with 
similar characters, @/EEE. 
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increment, input element fetch, multiply, and accumulate. 
These operations can take up to 10 clocks on many microproc- 
essors. Much of this overhead is memory fetch, since many 
state-of-the-art microprocessors are making more usc of sev- 
eral levels of intermediate data caching. And, as discussed 
previously, ANNs are notorious cache busters, so many mem- 
ory and input element fetches can take several clocks each. 

Simulating a three-layer BP network with N, inputs, Nj, 
nodes in the hidden layer, and N, nodes in the output layer will 
require (N,; * Nj) * (Nj, * No) + No connection clocks for 
non-learning, feed-forward operation on a single processor 
system. On CNAPS, assuming there are more PNs than hidden 
or Output nodes, the same network will require N, + Ny + No 
connection clocks. For example, assume that N, = 256, N,; = 
128, and No = 64. For a single processor system, the total is 
73,792 connection clocks; for CNAPS, 448. If a workstation 
takes about four cycles on average, which is typical, to com- 
pute a connection, then CNAPS is about 600x faster on this 
network. 
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Back-Propagation: Learning Phase 


The second and more complex aspect of BP Icarning is 
computing the weight delta for each connection. A detailed 
discussion of this computation and its CNAPS implemeniation 
is beyond the scope of this paper, so only a brief overview is 
given here. The computation is more-or-less the same as a 


sequential implementation. The basic learning operation in BP 
is to compute an error signal for each node. The error signal is 
proportional to that node’s contribution to the output error (the 
difference between the target output vector and the actual 
output error). From the error signal, a node can then compute 
how to update its weights. At the output layer, the error signal 
is the difference between the feed-forward output vector and 
the target output vector for that training vector. The output 
nodes can compute their error signals in parallel. 

The next step is to compute the delta for each output 
node’s input weight (the hidden-to-output weights). This com- 
putation can be done in parallel, with each node computing, 
sequentially, the deltas for all weights of the output node on 
this PN. If a batching algorithm is used, then the deltas are 
added to a data element associated with each weight. After 
several weight updates have been computed, the weights are 
updated according to an accumulated delta. 

The next step is to compute the error signals for the 
hidden-layer nodes, which requires a multiply-accumulate of 
the output-node error signals through the output-node weights. 
Unfortunately, the output-layer weights are in the wrong place 
(on the output PNs) for computing the hidden-layer errors. 
That is, the hidden nodes need weights that are scattered 
among the output PNs, which can best be represented as a 
transpose of the weight matrix for that layer. A transpose 
operation is slow on CNAPS, taking O(N’) operations. The 
easiest solution was to maintain two weight matrices for each 
layer, the feed-forward version and a transposed version for 
the error back-propagation. This requires twice the weight 
memory for each hidden node, but permits error propagation 
to be parallcl, not serial. Although the new weight value need 
only be computed once, it must be written to two places. This 
duplicate weight matrix is required only if learning is to be 
performed. 

After the hidden-layer error signals have been computed, 
the weight delta computation can proceed exactly as described 
above. If more than one hidden layer is used, then the entire 
process is repeated for the second hidden layer. The input layer 
does not require the error signal. 

For non-batched weight update, where the weights are 
updated after the presentation of each vector, the learning 
overhead requires about five times more cycles than feed-for- 
ward execution. A 256 PN (four chip) system with all PNs busy 
can update about one billion connections per second, almost 
one thousand times faster than a Sparc2 workstation. A BP 
network that takes an hour on a Sparc? takes only a few 
seconds on CNAPS. 


Simple Image Processing 


One major goal of CNAPS was flexibility because, by 
Amdahl’s law, the more of the problem that can be parallelized, 
the better. Ti.crefore, other, parallelizable, but non-ANN, parts 
of the problem should also be moved to CNAPS where possi- 
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Figure 13. 


Distinguishing members of a group by focusing on a group 
specific subfield. Here a more detailed 32 x 32 image is used, 
OIEEE. 












































ble. Many imaging applications, including OCR programs, 
require image processing before turning the ANN classifier 
loose on the data. A common image processing operation is 
convolution by spatial filtering. 

Using spatial (pixel) filters to enhance an image requires 
more complex computations than simple pixel operations re- 
quire. Convolution, for example, is a common operation per- 
formed during feature extraction to filter spatial noise or define 
edges. Here, a kernel, an M by M dimensional matrix, is 
convolved over an image. In the equation below, for instance, 
the local kernel, k, is convolved over an N by N image, a, to 
produce a filtered N by N image b: 


b, j=) dPi-v.i-e 
Pa 


(isij<N) (1 sp.q<M) 


Typical convolution kemels are Gaussian, differences of 
Gaussian, and Laplacian filters. Due to their inherent parallel- 
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ism, convolution algorithms can be easily mapped to the 
CNAPS architecture. The image to be filtered is divided into 
regions called “tiles”. One or more tiles are mapped to each 
PN. The kernel values, k, are then broadcast to the array. If the 
image does not fit in the PN array, then only a subset of the 
tiles are moved in and computed at a time. 

There are two ways to deal with the fact that the convolu- 
tions will overlap at the edges of tiles: send in overlapping 
tiles, or send values in tile edges to neighboring PNs (that have 
neighboring tiles) via the inter-PN bus before each convolu- 
tion operation. The first method is used if only one convolution 
is performed on a tile before outputting it again. This method 
is used by our PhotoShop accelerator product, which will be 
discussed below. The second method is used by general image 
processing routines that do several operations on an image 
before it is output. 

For neural networks, the problem data is the same for each 
node in the network and the coefficients (weights) are differ- 
ent. This led to allocating the coefficients the PNs and broad- 
casting the data. However, because of the sparseness of the 
convolution, it is more efficient to put the data (the image tiles) 
on the PNs and then broadcast the coefficients. 

Because of the parallel structure of this algorithm, all PNs 
can calculate the convolution kernel at the same time, convolv- 
ing all pixcls in one row simultaneously. Using different 
kernels, this convolution process can be carried out several 
times, each time with a different type of spatial filtering 
performed on the image. 

For a 18MB image in full color, RGB 3-byte per pixel 
representation, a 7x7 convolution kernel can be performed by 
a single 64-PN chip in about 7 seconds. This includes the time 
to move the image onto the CNAPS board and off again. 


Naval Air Warfare Center 


At the Naval Air Warfare Center (NAWC) at China Lake, 
California, ANN technology has been aimed at air-launched 
tactical missiles. Processing sensor information on board these 
missiles demands a computational density (Operations per 
second per cubic inch) far above most applications. Tactical 
missiles typically have several high-data-rate sensors, each 
with its Own separate requirements for high-speed processing. 
The separate data must then be fused, and the physical opera- 
tion of the missile controlled. All this must be done under 
millisecond or microsecond time constraints and in a volume 
of a few cubic inches. Available power is measured in tens of 
watts. Such immense demands have driven NAWC re- 
searchers toward ANN technology. 

For some time (1986 to 1991), many believed that analog 
hardware was the only way to achieve the required computa- 
tional density. The emergence of wafer scale, parallel digital 
processing (exemplified by the CNAPS chip) has changed that 
assessment, however. With this chip, we have crossed the 
threshold at which digital hardware — with all its attendant 





flexibility advantages — has the computational density needed 
to be useful in the tactical missile environment. Analog VLSI 
may still be the only way to overcome some of the most acute 
time-critical processing problems on board the missile, for 
example, at the front end of an image processing system. A 
hybrid system combining the best of both types of chips may 
easily turn out to be the best solution. 

Researchers at NAWC have worked with several versions 
of the CNAPS system. They have easily implemented cortico- 
morphic computational structures on this system — structures 
that were difficult or impossible under the analog constraints 
of previous systems. They have also worked with Adaptive 
Solutions to design and implement a multiple-controller 
CNAPS system (a multiple SIMD architecture or MSIMD) 
with high-speed data-transfer paths between the subsystems. 
And they are completing the design and fabrication of a 
real-time system interfaced to actual missile hardware. The 
current iteration will be of the SIMD form, but the follow-on 
will have the new MSIMD structure. 

The prototype system, called MAVIS (Missile borne Ar- 
tificial Vision System), is accurate and effective, but is also 
compute intensive. As input, the system takes images of 
128x128 pixels, with 8 bits per pixel, at 60 images per second. 
The algorithm discussed above requires about 12,000 opera- 
tions per pixel per frame or 11 billion operations per second. 
This system must fit inside a missile and consume only a 
moderate amount of power. This implementation uses three 
256 PN CNAPS VME cards and operates in real time. With 
the current CNAPS implementation technology, it would be 
possible to put the system in the target platform by using a 
Multi-Chip Module with a ceramic substrate. 

One important near-term application of this computa- 
tional structure is in the area of adaptive, non-uniformity 
compensation for staring focal plane arrays. It appears also that 
this structure will allow the implementation of three dimen- 
sional wavelet transforms where the third dimension is time. 


Lynch/Granger Pyriform Implementation 


Researchers Gary Lynch and Richard Granger at the Uni- 
versity of California, Irvine have produced an ANN model 
based on their studies of the Pyriform cortex of the rat. The 
algorithm contains features abstracted from actual biological 
operation, and has been implemented on the CNAPS parallel 
computer [13]. 

The algorithm contains both parallel and serial elements, 
and lends itself well to execution on CNAPS. Clusters of 
competing neurons, called “patches” or “subnets,” hierarchi- 
cally classify inputs by first competing for the greatest activa- 
tion within each patch, then subtracting the most prominent 
features from the input as it proceeds down the lateral olfactory 
tract (the LOT, the primary input channel) to subsequent 
patches. Patch activation and competition occur in parallel in 
the CNAPS implementation. A renormalization function 


analogous to the automatic gain control performed in Pyriform 
cortex also occurs in parallel across competing PNs in the 
CNAPS array. 

Transmission of LOT input from patch-to-patch is an 
inherently serial clement of the Pyriform model, so opportu- 
nities for parallel execution for this part of the model are few. 
Nevertheless, overall speedups for execution on CNAPS 
(compared to execution on a serial machine) of 50 to 200 times 
are possible, depending on network dimensions. 

Refinements of the Pyriform model and applications of it 
to diverse pattern recognition applications continue. 


Sharp Kanji 


Another application that has successfully used ANNs and 
the CNAPS system is a Kanji optical character recognition 
(OCR) system developed by the Sharp Corporation of Japan. 
In OCR, a page of printed text is scanned to produce a bit 
pattern of the entire image. The OCR program’s task is to 
convert the bit pattern of each character into a computer 
representation of the character. In the US and Europe, the most 
common representation of Latin characters is the 8-bit ASCII 
code. In Japan, because of their unique writing system, it is the 
16-bit JIS code. ; 

OCR requires a complex set of image recognition opera- 
tions. Many companies have found that ANNs are effective for 
OCR because ANNs are powerful classifiers. Many commer- 
cial OCR companies, such as Caere, Calera, Expervision, and 
Mimetics, use ANN classifiers as a part of their application 
software. 

Japanese OCR is much more difficult than English OCR 
because Japanese has a larger character set. Written Japanese 
has two basic alphabets. The first is Kanji, or pictorial charac- 
ters borrowed from China. Japanese has tens of thousands of 
Kanji characters, although it is possible to manage reasonably 
well with about 3500 characters. Sharp chose these basic Kanji 
characters for their recognizer. 

The second alphabet is Kana, comprised of two phonetic 
alphabets (Hiragana and Katakana) having 53 characters each. 
Typical written Japanese mixes Kanji and Kana. Written Japa- 
nese also employs arabic numerals and Latin characters, typi- 
cally found in business and newspaper writing. A commercial 
OCR system must be able to identify all four types of charac- 
ters. To add further complexity, any character can appear in 
several different fonts. 

Japanese keyboards are difficult to use, so a much smaller 
proportion of business documentation than one sees in the 
United States and other western countries is in a computer 
readable form. This difficulty creates a great demand for the 
ability to accurately read printed Japanese text and convert it 
to the corresponding JIS code automatically. Unfortunately, 
due to the large alphabet, the computer recognition of written 
Japanese is a daunting task. At the time this paper is being 
written, the commercial market consists of slow (10-50 char- 
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acters per second), expensive (tens of thousands of dollars), 
and marginally accurate (96%) systems. Providing high speed 
and accuracy for a reasonable price would be a quantum leap 
in capability in the current market. 

Sharp Corporation and Mitsubishi Electric Corporation 
have both built prototype Japanese recognition systems based 
on the CNAPS architecture. Both systems recognize a total of 
about 4000 characters in 50 different fonts at accuracies of 
over 99% and speeds of several hundred characters per second. 
These applications have not yet been released as commercial 
products. 

Sharp’s system uses a hierarchical three-layer network, 
[8] and [20], (Figures 12 and 13). Each layer is based on 
Kohonen’s Learning Vector Quantization (LVQ) algorithm, a 
Bayesian approximation that shifts the node boundaries to 
maximize the number of correct classifications. In Sharp’s 
system, unlike back-propagation, each hidden-layer node rep- 
resents a character class, and some classes are assigned to 
several nodes. Ambiguous characters pass to the next layer. 
When any layer unambiguously classifies a character, it has 
been identified, and the system moves on to the next character. 

The first two levels take as input a 16x16 pixel image (256 
elements). With some exceptions, these layers classify the 
character into multiple subcategories. The third level has a 
separate network per subcategory. It uses a high-resolution 
32x32 pixel image (1024 elements), focusing on the subareas 
of the image known to have the greatest differences among 
characters belonging to the subcategory. These subareas of the 
image are trained to tolerate reasonable spatial shifting without 
sacrificing accuracy. Such shift tolerance is essential due to 
differences among fonts and shifting during scanning. 

Sharp’s engineers clustered 3303 characters into 893 sub- 
categories containing similar characters. The use of subcate- 
gories let Sharp build and train several small networks instead 
of one large network. Each small network took its input from 
several local receptive fields designed to look for particular 
features. The locations of these ficlds were chosen automat- 
ically during training to maximize discriminative information. 
The target features are applied to several positions within cach 
receptive field, enhancing the shift tolerance of the ficld. 

On a data base of scanned characters that included more 
than 26 fonts, Sharp reported an accuracy of 99.92% on the 13 
fonts used for training and 99.01 percent accuracy on charac- 
ters on the 13 fonts used for testing. These results show the 
generalization capabilities of this network. 


Photoshop Acceleration 


The “Prepress” market segment involves activitics that 
occur between the development of an electronic document and 
its preparation for printing. The most complex preparation 
tasks concern photographs, figures, and other kinds of com- 
plex images in the document. The most popular program for 
manipulating photographs is Adobe Photoshop. 
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Most commercial images (in magazines, advertising bro- 
chures, and similar publications) contain are large, typically 
tens of megabytes. Simple image processing functions, such 
as a 2D spatial filter, can be slow, even on high speed comput- 
ers. Because the performance-cost of CNAPS exceeds that of 
traditional desktop computers, Adaptive Solutions decided to 
build a Photoshop CNAPS accelerator card, “PowerShop.” 
The small, simple CNAPS PN arrays and the use of redun- 
dancy allows us to offer such performance at an affordable 
price. In addition, the these images typically use low precision, 
integer data representations which also maps efficiently to 
CNAPS. 

Photoshop uses several filters. These filters are imple- 
mented like the convolution discussed above. Since Photoshop 
updates the displayed version of the image in the host memory 
after each operation, the CNAPS card reads an image, a single 
operation is performed, then the image is written back to main 
memory. For this reason a two-level tiled version of the con- 
volution algorithm is used. The image is broken up into large 
tiles, each the size of a PN array. CNAPS then processes one 
tile at a time, sequentially reading a tile, then writing an 
updated version of the tile back to main memory. 

In gencral, the performance improvements of the CNAPS 
PowerShop card over a Power Mac (PCI bus based) range from 
factors of 3-10x. A 7x7 convolution filter over a 24-bit full 
color 18 MB image using a 64 PN array is about 7 seconds, 
versus 89 seconds on a PowerMac 8100. 


Medical Image Processing 


An important area of image processing and pattern recog- 
nition concerns the classification medical images, a field that 
has significant computing requirements and increasing pres- 
sure to decrease costs. Reading and analyzing scanned im- 
ages—whether MRI scans, optical scans such as Pap smears, or 
X-rays for suspicious structures such cancer cells—is a matter 
of life or death. The data is noisy and ambiguous, and is error 
prone. R2 Technology has developed a neural network based 
classification algorithm for identification areas of interest in 
mammograms. 

The R2 application uses a combination of standard image 
processing techniques for image preprocessing and then a neural 
network algorithm for the final classification. The CNAPS PCI 
board with 128 PNs can scan an entire 4K x 4K X-ray in 14 
seconds, which mects R2’s performance requirements. 


Conclusion 


This paper has given only a brief view into a commercial 
ANN product and into the decisions made during its design. It 
has also bricfly examined some real applications that use this 
product. The reader should have a better idea about why the 
various design decisions were made during this process and 
the final outcome of this effort. The CNAPS system has 





achieved its goals in speed and performance and, as discussed, 
is finding its way into real world applications. 
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Research Notes 


An Intelligent Chip 


The fastest neural network processor, the Nil000 com- 
puter chip, was developed recently by Intel Corp. of Santa 
Clara, California. The new technology is the most promising 
approach toward building intelligent machines that mimic 
hearing, seeing and thinking. The chip is amazingly quick at 
recognizing handwriting, identifying military targets and per- 
forming other tasks that are difficult or impossible for conven- 
tional chips. There are numerous civilian applications for this 
chip, including finger print identifications, automatic mailing 
address processing, and even stock market forecasting and 
predictions. 

The new Nil000 chip, developed with funding from the 
Advanced Research Projects Agency (ARPA) and the Office 
of Naval Research (ONR), is an unusual breed called neural 
networks. These chips work more like the human brain than 
the microprocessors used in millions of personal computers. 
Because they can recognize visual or sound patterns at high 
speed, neural nets are being applied to tricky tasks such as 
distinguishing human voices and zip codes. ARPA is interested 
in these chips for identifying submarines and other targets. 

Intel’s new chip is expected to be particularly useful in 
handwriting recognition, a rapidly growing market, and will 
be a hundred times faster than other technologies used for that 
purpose. Nestor Inc. of Rhode Island developed a version of 
the handwriting-recognition algorithm for the Nil000. A scan- 
ner based on a fast version of Intel’s 486 microchip can 
recognize about 30 handwritten characters per second while 
the Ni1000 is expected to recognize 5,000 to 10,000 charac- 
ters. Although the chip requires too much electric current for 
use in small computers, Intel is working to improve the chip 
for use in hand-held machines. 

Where other chips answer precise mathematical ques- 
tions, neural net chips can be trained to work on more subjec- 
tive problems. Interconnected processing elements on each 
chip, called neurons, join in different ways when exposed to 
different signals. By employing a large number of processing 
elements that operate in parallel, the Nil000 performs 20 
billion interconnection operations per second. The chip uses a 
large block of flash memory so that learned patterns can be 
“memorized” and quickly “recalled” for real-time pattern 
recognition applications. Learning capability is implemented 
on-chip in the form of a 16-bit microcontroller. 

“The Nil000 chip represents a new generation of highly 
intelligent, high performance chips based on the neural net- 
work computations paradigm,” said Dr. Clifford Lau, acting 
director of the Electronics Division at ONR and the scientific 
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officer overseeing ONR’s participation in the chip develop- 
ment 

ONR has a long history of supporting neural networks 
research. In the 1950’s ONR funded the research of F. Rosen- 
blatt on the perceptron, which is now the basic processing 
element of multilayer perceptron neural networks. In the 
1960’s, ONR supported the research of Professor B. Widrow 
at Stanford University on the adaptive linear neuron, or 
ADALINE, together with the least mean square adaptation 
algorithm, which now forms the basis for the popular back 
propagation learning algorithm in artificial neural networks. 
ONR recognized in the 1980’s the importance of under- 
standing how the brain processes and stores information, and 
started to invest in research on learning and memory. The 
objectives of ONR’s programs today according to Dr Lau are 
“to understand the architectures of the brain and the algorithms 
for brain information processing, and to formulate computa- 
tional neuroscience models. 

Dr. Leon Cooper, a long time ONR principal investigator 
and Nobel laureate said “Combining neural networks that 
learn and capture the human ability for rapid pattern recogni- 
tion with the processing power of personal computers will 
bring us to the next generation of decision-making machines. 
A Bionic Eye 

“A computer-packed bionic eye may soon match the 
sensitivity of human and animal eyes,” say Professor Leon O. 
Chua of the University of California at Berkeley and Professor 
Tomas Roska of the Hungarian Academy of Sciences at Bu- 
dapest. Their research program, which is another case of 
science trying to imitate nature, is sponsored by the National 
Science Foundation and the Office of Naval Research. AtONR 
the Scientific Officers overseeing the work are Dr. Clifford 
Lau of the Systems and Electromagnetic Theory Division and 
Dr. Joel Davis of the Computational Neuroscience Division. 

This bionic eye is part of the popular trend of combining 
the disciplines of biologists and computer scientists to endow 
machines with intelligence and senses. The scientists are 
working toward a supercomputer etched into a thumbnail-size 
chip on which an image is focused. The computer-eye will be 
made to see like a cat, salamander, hawk, or person. The nerve 
circuitry will be able to pick out some things and pay less 
attention to others. 

Eyes include layers of densely packed neurons, literal 
extensions of the brain, that sort the image on the retina into 
lines, corners, shades of color and gray, edges and moving 
objects even before the image goes to the brain for more 





abstract analysis. “The thing is, eyes don’t work like a camera, 
just focusing and recording images,” said Professor Frank 
Werblin of Berkeley, who has recently joined the bionic eye 
team. Werblin is known for his pioneering work on the densely 
packed nerves in the retinas of salamanders and uses a con- 
ventional computer the size of a refrigerator to mimic the 
vision in a salamander’s eye. The bionic eye project hopes to 
put the entire computer on a chip of silicon smaller than a 
thumbnail but with the potential calculating speed of the 
biggest super compuiers: a trillion calculations per second. 

Roots to the bionic eye go back to a tiny, computer-on-a- 
chip called the cellular neural network (CCN), that Chua 
developed and is now refining with funding from ONR under 
the supervision of Dr. Rabinder Madan of the System and 
Electromagnetic Theory Division. Chua has applied for a 
patent, naming the device the “CNN Universal Machine.” The 
CNN contains hundreds to thousands of interconnected, iden- 
tical “cells” each one a simple bit of circuitry connected to its 
nine closest neighbors. Properly set up so that each cell knows 
directly only what it and its neighbors see, but with all the cells 
cooperating and calculating at once, information surges back 
and forth at lightening speed. If programmed to see only 
straight lines shorter than a certain length, such a device would 
immediately spot flaws in woven fabric. 

Some of the layers in salamander retinas see only moving 
objects, other see objects only of certain sizes, and other look 
for edges. In frogs, the system blinds the animals to almost 
everything except objects of the same size and motion as the 
flies and moths they snare with their tongues For salamanders 
and most vertebrates, eyes work the same way. The presorted 
images get to the brain’s visual cortex where another dozen or 
so additional layers of brain tissue further break images down. 
A single neural network, however, could take the place of all 
the layers of neurons, switching rapidly from one mode to 
another. 

Way down the road, perhaps many decades from now, 
there could be truly bionic eyes as on TV shows, where badly 
injured people are turned into part-machine superheroes. Ar- 
tificial eyes could let the blind see by hooking directly into 
optic nerves or even brains, but nobody knows how to do that 
yet. 


Soon, scientists boldly predict that the computer-eye 
might recognize wanted criminals or lost children, instantly 
detect flaws in manufactured goods, identify targets for auto- 
mated military weapons or recognize mineral deposits from 
space. 

The bionic eye team is learning the truth of the old saying, 
“Beauty is in the eye of the beholder.” 


A Computer with an IQ 

When Professors Richard Granger and Gary Lynch at the 
University of California, Irvine, duplicated six years ago brain 
circuitry in a computer program, they had no idea that it would 


begin acting like a brain. Tuey mapped the circuits of a small 
piece of rat brain and then duplicated the circuits in a computer 
program, “just to see what would happen.” Last year, the 
Office of Naval Research (ONR) tested with great success the 
program for relevant Navy use, such as recognizing sonar 
signals. 

Soon after creating the program, Granger and Lynch 
started feeding their computer signals, simulations of the 
electrical impulses chemical stimulants create in the brain. Not 
too surprisingly, the computer stored memory of the stimuli as 
the brain does and could recognize them when it perceived 
them again. One night while Lynch was playing the program, 
the computer did a new trick. When it was fed a simulated 
“odor,” it not only sent back the recognition signal, it sent back 
a preliminary signal as well. Granger and Lynch were amazed 
when they realized months later that the second signal denoted 
a category, a grouping of similar odors that the computer had 
devised all on its own. Without being told to do it, the computer 
had grouped all flower smells together and all cheese smells 
together. “It had spontaneously reproduced a psychological 
process.” Lynch says, “because that’s how the brain circuits 
are designed to operate. You and a rat and every mammal do 
it without thinking. 

When the computer memorized enough chemical stimuli 
and put them into categories, it performed some sophisticated 
recognition. It detected odors masked by stronger and different 
chemical stimuli. It would say, “that’s roses, and there’s a 
magnolia and some cheddar in there, too.” Once the computer 
was wired like a brain, it acted like a brain. You could not 
instruct it to record or sort odors. You could only present it 
with stimuli and let it do what it pleased with them. 

A human brain has 10 billion brain cells or neurons, but 
the computer has only 1,000 simulated neurons. Lynch claims 
that the computer learned to recognize 10,000 words. “ If we 
could build a model with 100,000 neurons, we could have 
taught it a new word every five seconds for 50 years; it would 
be eager for more and categorizing them.” 

When ONR tested this “thinking” computer, it mastered 
difficult classifications which involved real ocean passive 
acoustic signals; the computer recognized 95 percent of the 
signals and gave no false alarms. The best records before had 
been 25 percent and 60 percent on two other systems. 

Dr. Joel Davis, the ONR scientific officer who has funded 
this work since its inception, says, “The Lynch-Granger pro- 
gram follows biological patterns more closely than any pre- 
vious neural program. It’s strongest where traditional 
programs are weakest — recognizing complex patterns. Be- 
sides classifying sonar signals, it might recognize the vibration 
patterns of mechanical parts about to fail and give warning. 
This field is in its infancy.” 

Perhaps HAL, the think feeling computer in the movie 
“2001” is a possibility for the not too distant future. 
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